Blue Collar Bioinformatics

Note: new posts have moved to Please look there for the latest updates and comments

Posts Tagged ‘biopython

Biopython projects for Google Summer of Code 2010

with 5 comments

Google Summer of Code provides the unique opportunity for students to spend a summer working on open source projects and getting paid. Biopython was involved with two great projects last summer, and it’s time to apply for this year’s program: the student application period is from next Monday, March 29th to Friday, April 9th, 2010.

If you are a student interested in biology and open source work, there are two community organizations to look at for mentors and project ideas:

  • NESCent Phyloinformatics — NESCent is a GSoC mentoring organization for the 4th year, focusing on projects related to phylogenetics and open source code.
  • Open Bioinformatics Foundation — The umbrella organization that manages BioPerl, Biopython, BioJava, BioRuby and several other popular open source bioinformatics projects is involved with GSoC for the first time.

This year, I’ve collaborated on three project ideas centering around the idea of tool integration. An essential programming skill for dealing with large heterogeneous data sets is combining a set of tools in a way that abstracts out the implementation details, instead allowing you to focus on the high level biological questions. Bradford Cross, a machine learning and data crunching expert at FlightCaster, describes this process brilliantly in an interview at Data Wrangling.

These three project ideas allow a student to develop essential toolkit integration skills, while having the flexibility to work on biological questions relevant to their undergrad or graduate research:

All involve taking two or more different toolkits and combining the functionality into a higher level interface focused around ease of use. They are intentionally broad and flexible ideas, and a student proposal should concentrate on functionality most relevant to their biological questions. Ideally the work would be both a publicly available resource, and contribute directly to the student’s daily research.

If you’re interested in these ideas and in working with a set of great mentors, definitely get in touch with me either through the project mailing lists or directly. If none of these ideas strike your fancy but you would like to be involved with GSoC, get in touch with a mentor from one of the other project ideas at NESCent and OpenBio. It’s a unique opportunity to develop new coding skills, work with great mentors, and give back to the open source community.

Written by Brad Chapman

March 26, 2010 at 8:02 am

Posted in OpenBio

Tagged with , ,

Python query interface to BioGateway SPARQL endpoint and InterMine

with one comment

I spent the last week in Tokyo, Japan at BioHackathon 2010, an extremely productive meet up of open source coders organized by Toshiaki Katayama and the great folks at the Database Center of Life Sciences (DBCLS). The focus of the week was improving biological toolkits for accessing the Semantic Web of linked data.

The technical focus was on RDF (Resource Description Format), a standard way to represent data as a triple; each triple is made up of a subject defining the item, a predicate specifying a relationship, and a object representing the linked data. By providing data in RDF along with common naming schemes for common objects we facilitate linking biological data in mashups, expanding the ability of researchers to discover relationships between disparate data sources.

RDF is managed in data stores like Virtuoso, which are equivalent to relational or document based databases. For programmers, the primary technology for querying these stores is SPARQL, a query language similar to SQL. The goal of the Biopython programming team at the Hackathon was to provide an easy to use Python library to query RDF stores through SPARQL.

Interface design

The python interface organizes common biological queries on a datastore without exposing the backend implementation details. The final interface simplifies the process of making a query to two steps:

  • Instantiate a query builder, providing it with two sets of data: attributes to retrieve and items to filter against. This is modeled after BioMart querying and the R biomaRt interface, providing a generic, well understood way to specify the query.
  • Pass the query builder to a retrieval server which executes the query and returns the results in a tabular format, as a numpy RecordArray.

The user is never exposed to the underlying implementation details, as the library performs the work of building the query, submitting it to the remote server and reformatting the results.

Two interfaces were implemented at BioHackathon 2010:

  • BioGateway — a SPARQL endpoint to a RDF data store containing SwissProt data semantically linked to Gene Ontology (GO) terms.
  • InterMine — a XML based interface to a traditional relational database backend, containing organized metadata for primary experimental data from many organisms.

By providing a common interface for both semantic data and more traditional data sources, we hope to facilitate the conversion by data providers to RDF where it simplifies their backend storage and queries. Users of this high level interface do not need to worry about the underlying implementation, instead focusing resources on developing their biological queries.


BioGateway organizes the SwissProt database of protein sequences along with Gene Ontology Annotations (GOA) into an integrated RDF database. Data access is provided through a SPARQL query endpoint, allowing searches for proteins based on a combination of GO and SwissProt data.

This query searches for proteins that are involved in insulin response and linked to diabetes. The protein name, other proteins associated via protein-protein interaction, and the gene name are retrieved:

from systemsbio import Biogateway, UniProtGOQueryBuilder
builder = UniProtGOQueryBuilder("Homo sapiens")
builder.add_filter("GO_term", "insulin")
builder.add_filter("disease_description", "diabetes")
builder.add_attributes(["protein_name", "interactor", "gene_name"])
server = Biogateway()
results =
print len(results), results.dtype.names
result = results[0]
print result['protein_name'], result['gene_name'], \
      result['interactor'], result['GO_term']

An orthogonal search approach is to start with a protein of interest and retrieve linked details. Here we identify primary journal papers about a protein:

from systemsbio import Biogateway, ReferenceBuilder
builder = ReferenceBuilder()
builder.add_filter("protein_id", "1433B_HUMAN")
server = Biogateway()
results =
print len(results), results.dtype.names
result = results[0]
print result['protein_id'], result['reference']

Using the associated PubMed IDs, we can retrieve the paper details using Biopython and NCBI Entrez, utilizing BioGateway as part of an automated analysis pipeline:

from Bio import Entrez = ""
pubmed_id = result['reference'].replace("PMID_", "")
handle = Entrez.esummary(db="pubmed", id=pubmed_id)
record =[0]
print record['Title']
print record['PubDate']
print ",".join(record['AuthorList'])
print record['FullJournalName'], record['Volume'], record['Pages']
# Novel raf kinase protein-protein interactions found by an exhaustive yeast two-hybrid analysis.
# 2003 Feb
# Yuryev A,Wennogle LP
# Genomics 81 112-25

The full source code is available from GitHub: The implementation builds a SPARQL query based on the provided attributes and filters, using SPARQLwrapper to interact with the remote server and parse the results.


InterMine is an open source data management system used to power databases of primary research results like FlyMine and modMine. It stores metadata associated with projects in a structured way, enabling searches to identify data submissions of interest to biologists. It contains two useful web based tools to facilitate these searches:

  • Templates — Pre-defined queries that capture common ways biologists search the database.

  • Query builder — A graphical interface to define custom queries, allowing manual discovery of attributes of interest.

We access InterMine programatically using the same builder and server paradigms used in our BioGateway interface. The query below searches modMine for C. elegans experiments characterizing the H3K4Me3 histone modification, which is associated with chromatin structure in active genes. The returned submission identifiers can be used to examine the primary data associated with the experiment:

from intermine import Intermine, SubmissionQueryBuilder
builder = SubmissionQueryBuilder()
    "submission_title", "developmental_stage"])
builder.add_filter("organism", "Caenorhabditis elegans")
builder.add_filter("antibody_name", "H3K4me3")
server = Intermine("")
table =
print table.dtype.names
print table
# ('submission_id', 'submission_title', 'developmental_stage')
# [('176', 'Histone_H3K4me3_from_N2_L3_larvae_(AR0169_H3K4me3_N2_L3)', '')
# ('2311', 'Histone_H3K4me3_N2_EEMB_(WA30534819_H3K4ME3_N2_EEMB)',
# 'Early Stage Embryos')
# ('2410', 'Histone_H3K79me1_N2_EEMB_(AB2886_H3K79ME1361912_N2_EEMB)',
# 'Early Stage Embryos')]

An advantage of defining query builders is that we can provide custom functionality to access more complex queries. The code below searches for C. elegans ChIP-seq experiments using a free text search. The implementation searches for the query term against several description fields in the database, hiding these details from the user:

from intermine import Intermine, ExperimentQueryBuilder
builder = ExperimentQueryBuilder()
builder.add_attributes(["submission_id", "experiment_name"])
builder.add_filter("organism", "Caenorhabditis elegans")
server = Intermine("")
table =
print table.dtype.names
print table
# ('submission_id', 'experiment_name')
# [('582', 'ChIP-Seq Identification of C. elegans TF Binding Sites')
# ('584', 'ChIP-Seq Identification of C. elegans TF Binding Sites')
# ...

The full source code is available from GitHub: It builds a XML query for submission to InterMine Web Services, handling submitting and parsing the result.


It is not a coincidence that a diverse set of tools like InterMine, BioGateway and BioMart were used in building these interfaces. The collaborative environment at BioHackathon 2010 facilitated productive discussions with the authors of these projects, leading to the API development and implementation. If you are interested in more details about the week of programming, check out the day to day summaries:

You are invited to fork and extend the code on GitHub.

Written by Brad Chapman

February 15, 2010 at 11:33 am

Automated retrieval of expression data with Python and R

with 10 comments

In February, I’ll be headed to Japan for BioHackathon 2010, which focuses on integrating biological datasets. In preparation, I’ve been thinking about use cases: when working with a set of data, what other information do you want to able to readily retrieve that will help you better answer your questions? One common set of data is gene expression information; you identify a group of genes of interest and are curious about their expression levels.

In an ideal semantic world, you would be able to build a query through a set of RDF triples like:

  • Expression set -> has organism -> Mus musculus
  • Mus musculus -> has cell type -> proB
  • proB -> has biological state -> wild type
  • Expression set -> has identifier -> A RefGene ID
  • A RefGene ID -> has expression value -> 7.65

So you could find expression sets unique to your cell type under standard conditions, and then identify the expression values of your genes. Ideally, expression sets would also have additional meta-data associated with them so you’d know what 7.65 meant in the context of the experiment.

Although we can’t do this query through semantic channels, there are a number of toolkits that help us answer the question in an automated way. NCBI’s Gene Expression Omnibus (GEO) provides an online resource hosting expression data. You can query the data sets with EUtils web API using Biopython. Once a data set of interest is identified, Bioconductor’s GEOquery can retrieve the data sets and mappings to UCSC RefGene identifiers. All of this is tied together with the Rpy2 python interface to R.

At the top level, we define our goal, which is to retrieve expression data for mouse proB wild type cells:

def main():
    organism = "Mus musculus"
    cell_types = ["proB", "ProB", "pro-B"]
    email = ""
    save_dir = os.getcwd()
    exp_data = get_geo_data(organism, cell_types, email, save_dir,

def _is_wild_type(result):
    """Check if a sample is wild type from the title.
    return result.samples[0][0].startswith("WT")

We query GEO for the possible data sets based on our organism and cell type, and then search through the results for a dataset that matches our criteria. In real life work, your personal _is_wild_type function may be an interactive dialog that presents the biologist with possible experiments, allowing them to select the one of interest. Note also that we store our final result locally, as a Python pickle. Practically, this reduces our load on external servers by only querying them once:

def get_geo_data(organism, cell_types, email, save_dir, is_desired_result):
    save_file = os.path.join(save_dir, "%s-results.pkl" % cell_types[0])
    if not os.path.exists(save_file):
        results = cell_type_gsms(organism, cell_types, email)
        for result in results:
            if is_desired_result(result):
                with open(save_file, "w") as out_handle:
                    cPickle.dump(result, out_handle)

    with open(save_file) as save_handle:
        result = cPickle.load(save_handle)

The hard work of querying GEO and retrieving the results is done by Biopython’s Entrez interface. After building up the query, the results are parsed into simple objects with a description of the expression set along with titles and identifiers for each of the samples that match our cell type:

def cell_type_gsms(organism, cell_types, email):
    """Use Entrez to retrieve GEO entries for an organism and cell type.
    """ = email
    search_term = "%s[ORGN] %s" % (organism, " OR ".join(cell_types))
    print "Searching GEO and retrieving results: %s" % search_term
    hits = []
    handle = Entrez.esearch(db="gds", term=search_term)
    results =
    for geo_id in results['IdList']:
        handle = Entrez.esummary(db="gds", id=geo_id)
        summary =
        samples = []
        for sample in summary[0]['Samples']:
            for cell_type in cell_types:
                if sample['Title'].find(cell_type) >= 0:
                    samples.append((sample['Title'], sample['Accession']))
        if len(samples) > 0:
            hits.append(GEOResult(summary[0]['summary'], samples))
    return hits

Once the experiment is selected we’ve accomplished the first part of our goal. Next, the plan is to get expression data values. We do this by building up a dictionary that maps RefGene IDs to expression values, so the results look like:

WT ProB, biological rep1 
[('NM_177327', [7.7430266269999999, 6.4795213670000003, 8.8766985500000004]), 
 ('NM_001008785', [7.4671954649999996, 5.4761453329999998]),
 ('NM_177325', [7.3312364040000002, 11.831475960000001]),
 ('NM_177323', [6.9779868059999997, 6.3926399939999996]),
 ('NM_177322', [5.0833683379999997])]

The actual hard work is retrieving the expression data and mappings to gene identifiers, and this is done using the GEOquery library in Bioconductor. Mapping information for expression values, and high level meta-data about the experiment, are stored in local files. The expression information is stored as a R style tab delimited data table, while the meta-data is stored in JSON format. Both sets of data are then present locally in readily readable formats for further processing:

def _write_gsm_map(self, gsm_id, meta_file, table_file):
    """Retrieve GEO expression values using Bioconductor, saving to a table.
    robjects.r.assign("", gsm_id)
    robjects.r.assign("table.file", table_file)
    robjects.r.assign("meta.file", meta_file)
      gsm <- getGEO(
      write.table(Table(gsm), file = table.file, sep = "\t", row.names = FALSE,
                  col.names = TRUE)
      cat(toJSON(Meta(gsm)), file = meta.file)

The previous function provides a mapping between probe names and expression levels. To be useful, we also need to map each of the probes on the expression array to biological gene identifiers. This is done through a second call to the GEO library to retrieve details for the expression platform. Again the data is presented in a R data frame, which we limit to only the items of interest and save as a tab delimited file:

def _write_gpl_map(self, gpl_id, gpl_file):
    """Retrieve GEO platform data using R and save to a table.
    robjects.r.assign("", gpl_id)
    robjects.r.assign("gpl.file", gpl_file)
      gpl <- getGEO( <- subset(Table(gpl), select=c("ID", "RefSeq.Transcript.ID"))
      write.table(, file = gpl.file, sep = "\t", row.names = FALSE,
                  col.names = TRUE)

Putting this all together, we download each of the mapping files and then parse them, building the connection: expression to probe ID to gene transcript IDs. The final dictionary maps these gene identifiers to the expression values and returns them:

def get_gsm_tx_values(self, gsm_id, save_dir):
    """Retrieve a map of transcripts to expression from a GEO GSM file.
    gsm_meta_file = os.path.join(save_dir, "%s-meta.txt" % gsm_id)
    gsm_table_file = os.path.join(save_dir, "%s-table.txt" % gsm_id)
    if (not os.path.exists(gsm_meta_file) or 
            not os.path.exists(gsm_table_file)):
        self._write_gsm_map(gsm_id, gsm_meta_file, gsm_table_file)

    with open(gsm_meta_file) as in_handle:
        gsm_meta = json.load(in_handle)
    id_to_tx = self.get_gpl_map(gsm_meta['platform_id'], save_dir)
    tx_to_vals = collections.defaultdict(list)
    with open(gsm_table_file) as in_handle:
        reader = csv.reader(in_handle, dialect='excel-tab') # header
        for probe_id, probe_val in reader:
            for tx_id in id_to_tx.get(probe_id, []):
    return tx_to_vals

The full script puts all of these parts together into a working example. In the end we go from our query to the expression data we need, although the route is a bit more circuitous then the idealized one described at the start of the post. Simplifying this type of question based approach to data access is one challenge bioinformatics tool developers will continue to face.

Written by Brad Chapman

January 2, 2010 at 5:43 pm

Trimming adaptors from short read sequences

with 11 comments

One common post processing step for next generation, or short read, sequencing is to remove adaptor sequences. Sample preparation for sequencing often involves selection of a particular fraction of interest like expressed genes, microRNAs, or copy number variant regions. Selecting, amplifying, and sequencing these regions can result in read sequences containing a common tag that needs to be identified and removed before continuing with other analysis steps like alignment to the genome. This is often the case when the reads of interest are small like in short RNAs or small sequence tags.

We start with a short known sequence (the adaptor) that will be present in our sequences of interest. For each read, we want to determine the position of this adaptor and remove it. Additionally, reads may contain the adaptor but differ by one or more bases; these differences are likely due to sequencing errors. So our solution needs to allow some fuzzy matching of the adaptor to the read sequences.

My first attempt involved using fuzzy string matching. In Python, several libraries exist to tackle this problem. The normal use case is in identifying misspelled words in text. For instance, you can calculate Levenshtein (edit) distance or use difflib in the standard library; Stack Overflow has a nice discussion of the different modules. Ultimately this approach proved to be a dead end; the properties that make English words similar do not overlap well with those differentiate DNA sequences.

That realization led me to tackle the problem by computing local alignments of the adaptor to each sequence. This effectively captures the biological intuition you need to trim a sequence with less than a specified number of errors. The downside to this rigorous approach is that it will be slower than purely string based methods.

The Biopython library contains a Python local alignment function suitable for quick alignment of short regions. We use this to align an adaptor region to a sequence and calculate the number of differences in the aligned region. To handle a common case where we can find the exact adaptor sequence, we first do an string match. This avoids the alignment overhead in many cases:

from Bio import pairwise2

def trim_adaptor(seq, adaptor, num_errors, right_side=True):
    gap_char = '-'
    exact_pos = str(seq).find(adaptor)
    if exact_pos >= 0:
        seq_region = str(seq[exact_pos:exact_pos+len(adaptor)])
        adapt_region = adaptor
        seq_a, adaptor_a, score, start, end = pairwise2.align.localms(str(seq),
                str(adaptor), 5.0, -4.0, -9.0, -0.5,
                one_alignment_only=True, gap_char=gap_char)[0]
        adapt_region = adaptor_a[start:end]
        seq_region = seq_a[start:end]
    matches = sum((1 if s == adapt_region[i] else 0) for i, s in
    # too many errors -- no trimming
    if (len(adaptor) - matches) > num_errors:
        return seq
    # remove the adaptor sequence and return the result
        return _remove_adaptor(seq, seq_region.replace(gap_char, ""),

If the alignment contains fewer than the specified number of errors we’ve found an adaptor and remove it, returning the trimmed sequence. The attribute right_side allows us to specify whether the trimming should be done on the right (3′ end) or the left (5′ end):

def _remove_adaptor(seq, region, right_side=True):
    """Remove an adaptor region and all sequence to the right or left.
    if right_side:
            pos = seq.find(region)
        # handle Biopython SeqRecords
        except AttributeError:
            pos = seq.seq.find(region)
        return seq[:pos]
            pos = seq.rfind(region)
        # handle Biopython SeqRecords
        except AttributeError:
            pos = seq.seq.rfind(region)
        return seq[pos+len(region):]

Here is an example script using this function along with Biopython to trim reads from a FASTQ format file.Each read is analyzed to determine if it contains the adaptor, with up to 2 errors. If the adaptor is identified and trimmed, the final useful read is written to a FASTA format output file.

from __future__ import with_statement
import sys
import os

from Bio import SeqIO

from adaptor_trim import trim_adaptor

def main(in_file):
    adaptor_seq = "TCGTATGCCGTCTTCTGC"
    num_errors = 2
    base, ext = os.path.splitext(in_file)
    out_file = "%s-trimmed.fa" % (base)
    with open(in_file) as in_handle:
        with open(out_file, "w") as out_handle:
            for rec in SeqIO.parse(in_handle, "fastq"):
                trim = trim_adaptor(rec.seq, adaptor_seq, num_errors)
                if len(trim) > 0 and len(trim) < len(rec):
                    rec.letter_annotations = {}
                    rec.seq = trim
                    SeqIO.write([rec], out_handle, "fasta")

You can get the full trimming code, along with tests, from GitHub. The script takes about a half hour to process 15 million 36bp reads. Other current trimming approaches are often integrated into the aligners themselves: for instance, NovoAlign appears to have similar alignment based trimming capabilities.

Written by Brad Chapman

August 9, 2009 at 8:18 pm

Posted in OpenBio

Tagged with , , ,

Biopython projects for Google Summer of Code

leave a comment »

Google Summer of Code is a brilliant program that provides a summer stipend for students to work on open source projects. This year, I have been evaluating Biopython student proposals under the watchful eye of The National Evolutionary Synthesis Center (NESCent). Biopython was lucky enough to have interest from several students, resulting in some excellent proposals. Two of these were selected for funding:

  • Eric Talevich’s proposal on adding support to Biopython for PhyloXML, an XML format for representing phylogenetic trees and associated data. I will be the primary mentor for this project, with Christian Zmasek, the author of PhyloXML, providing plenty of capable secondary mentoring.
  • Nick Matzke’s proposal on Biogeographical Phylogenetics. This is an integrated project that involves extracting biodiversity data from the web, merging it with phylogenetic data, and providing analysis and display libraries. Nick helped assemble a full team of mentors, and I will be providing input on integrating his code with Biopython while learning about Biogeography along the way.

This is my first year working with summer of code, and I could not be more impressed by the professionalism and hard work of the organizers at both Google and NESCent. The program is designed from the ground up to be simultaneously inclusive and rigorously selective. Students are asked to prepare detailed project plans to demonstrate they understand the subject, are able programmers, and can communicate effectively. There is also a strong eye given to students who are likely to continue on in their open source communities after the summer; after all, open source work is really built on the free labor of many dedicated people.

Having worked on open source projects for several years, it is invigorating and heartening to see a program designed to encourage the next generation of open source leaders. My time on the program thus far has been a great learning experience, and I am looking forward to a productive summer.

Written by Brad Chapman

April 20, 2009 at 5:38 pm

Posted in OpenBio

Tagged with , ,

Examining and adjusting your GFF file

with 2 comments

Generic Feature Format (GFF) defines a standard template for representing biological features. Within this template, however, is room for flexibility. Different GFF producers may decide to format their data in slightly different ways. While parsing these files is not a problem, correctly interpreting and utilizing the data can be. The in-development Biopython GFF parser provides utilities to get a high level summary of the elements of a GFF file, and to adjust line items in the file during parsing.

Examining a GFF file

When downloading a new GFF file, the first step is getting an overview of the file contents. The GFF parser provides a GFFExaminer class to help with this. The first function of interest is available_limits:

import pprint
from BCBio.GFF.GFFParser import GFFExaminer
gff_examiner = GFFExaminer()
possible_limits = gff_examiner.available_limits(gff_file)

It returns a dictionary defining various ways to limit your parsing of the GFF file, along with the count of each item. As an example, here is a trimmed dump from one of the test files. This has features on two different record ids — chromosomes ‘I’ and ‘X’, with 159 and 6 items, respectively. Also listed are the combination of the 2nd source column and 3rd type column, and the type column by itself:

{'gff_id': {('I',): 159,
            ('X',): 6},
 'gff_source_type': {('Coding_transcript', 'CDS'): 27,
                     ('Coding_transcript', 'exon'): 33,
                     ('Coding_transcript', 'five_prime_UTR'): 4,
                     ('Coding_transcript', 'gene'): 2,
                     ('Coding_transcript', 'intron'): 29,
                     ('Coding_transcript', 'mRNA'): 4,
                     ('Coding_transcript', 'three_prime_UTR'): 3,
                     ('mass_spec_genome', 'translated_nucleotide_match'): 7},
 'gff_type': {('CDS',): 57,
              ('exon',): 33,
              ('five_prime_UTR',): 4,
              ('gene',): 2,
              ('intron',): 29,
              ('mRNA',): 4,
              ('three_prime_UTR',): 3,
              ('translated_nucleotide_match',): 7}}

In addition to the overview of file contents, nested relationships are another important component of the file to understand. A summary of these is available through the parent_child_map function:

import pprint
from BCBio.GFF.GFFParser import GFFExaminer
gff_examiner = GFFExaminer()
pc_map = gff_examiner.parent_child_map(gff_file)

Again, a dictionary is returned. The keys in the dictionary are parent source and type elements in the file, while the values are children of those elements. For instance, here is the dictionary for a three tiered relationship where genes have mRNAs, and each mRNA can have coding regions, exons, introns, and 5′ and 3′ untranslated regions:

{('Coding_transcript', 'gene'): [('Coding_transcript', 'mRNA')],
 ('Coding_transcript', 'mRNA'): [('Coding_transcript', 'CDS'),
                                 ('Coding_transcript', 'exon'),
                                 ('Coding_transcript', 'five_prime_UTR'),
                                 ('Coding_transcript', 'intron'),
                                 ('Coding_transcript', 'three_prime_UTR')]}

Adjusting GFF lines during parsing

Occasionally you may run into a file format that you can comprehend, but that does not match your expectations for where items should be. SOLiD GFF files are one example; many thanks are due to David Schruth, who has been patiently working with me on the parsing of these files. They have the read name in the first column, which is normally used for the record ID the feature maps to. The actual record ID is contained as an attribute, i=1, where the 1 corresponds to the index of record it maps to in the original FASTA alignment file:

3_336_815_F3    solid   read    55409   55428   10.4    +       .       i=1

The GFFAddingIterator has an optional attribute, line_adjust_fn, which can be used to solve this problem. The function is called each time a line is read and passed a parsed dictionary representing the line. The dictionary for the above line is:

{'id': '',
 'is_gff2': False,
 'location': [55408, 55428],
 'quals': {
           'i': ['1'],
           'score': ['10.4'],
           'source': ['solid']},
 'rec_id': '3_336_815_F3',
 'strand': 1,
 'type': 'read'}

The function takes this item as an argument and returns the dictionary, but with any adjustments that need to be made. In our example, we look up the name of the record corresponding to the i=1 index. This name is swapped in for the rec_id, while that information moves to an attribute named read_name:

from Bio import SeqIO
from BCBio.GFF.GFFParser import GFFAddingIterator

class SolidFastaRemap:
    def __init__(self, initial_fasta):
        self._in_map = self._get_index_map(initial_fasta)

    def _get_index_map(self, in_file):
        in_map = dict()
        in_handle = open(in_file)
        for index, rec in enumerate(SeqIO.parse(in_handle, "fasta")):
            in_map[index] =
        return in_map

    def adjust_fn(self, results):
        # 1-based indexes; convert to 0-based
        rec_index = int(results['quals']['i'][0]) - 1
        read_name = results['rec_id']
        results['quals']['read_name'] = [read_name]
        results['rec_id'] = self._in_map[rec_index]
        return results

remapper = SolidFastaRemap(fasta_file)
gff_iterator = GFFAddingIterator(line_adjust_fn=remapper.adjust_fn)
rec_dict = gff_iterator.get_all_features(gff_file)

This allows you to fix the file during the parsing, saving multiple passes through the file. This general functionality can be used to cleanly deal with any other inconsistencies that crop up during your GFF parsing adventures.

Written by Brad Chapman

April 12, 2009 at 8:55 pm

Posted in OpenBio

Tagged with ,

More python GFF parsing — iterative parsing and GFF2 nested features

leave a comment »

Work on the python generic feature format (GFF) parser continues to push forward; many thanks to those who have provided feedback in helping to refine the functionality. Previously, we discussed the initial implementation, introduced MapReduce parsing for parallelization, and discussed deploying on a cluster and GFF2 parsing. This week describes an interface for iterator based parsing of GFF files and nested features for GFF2 files.

Iterative parsing of GFF

GFF files are line-based and related features can be located anywhere in the file. To guarantee all features are parsed and combined correctly, the entire file needs to be scanned and loaded. For large files, we require strategies to load the data without abusing all available memory.

In some cases, it is known that parsing the file in chunks will not result in any information being lost. GFF files produced by SOLiD for short read alignments are one common case. These are read based, non-nested, and quite large. To tackle these files, the parser now has an iterator based interface that can be used to iterate over sections of the file:

from BCBio.GFF.GFFParser import GFFAddingIterator

gff_iterator = GFFAddingIterator()
for rec_dict in gff_iterator.get_features(gff_file, target_lines=3000000):
    for rec in rec_dict.values():
        # deal with rec.features    

This parses a file into ~350MB sized pieces, returning a dictionary of Biopython SeqRecord objects keyed by their names. Each SeqRecord contains all of the features added from that chunk of the file. These can be persisted to a database or otherwise analyzed before proceeding on to the next chunk, keeping memory requirements more reasonable.

The file can still be filtered by feature type, allowing extraction of only features of interest. This example uses the Biopython SeqIO interface to parse an initial FASTA file with our sequences, and then adds coding sequences on chromosome one to it in chunks of 1 million lines:

from BCBio.GFF.GFFParser import GFFAddingIterator
from Bio import SeqIO

with open(seq_file) as seq_handle:
    seq_dict = SeqIO.to_dict(SeqIO.parse(seq_handle, "fasta"))
gff_iterator = GFFAddingIterator(seq_dict)
cds_limit_info = dict(
        gff_types = [('Coding_transcript', 'gene'),
                     ('Coding_transcript', 'mRNA'),
                     ('Coding_transcript', 'CDS')],
        gff_id = ['I']
for rec_dict in gff_iterator.get_features(gff_file,
        limit_info=cds_limit_info, target_lines=1000000):
    for rec in rec_dict.values():
        # deal with rec.features    

To avoid missing nested features, the iterator makes smart decisions about when to break the file. It is broken at points where all child features have their parents and can be expected to be nested correctly.

Nested features for GFF2

Nesting of features is handled nicely in the new GFF3 format. However, many sources provide information in the older GFF2 (also called GTF) format, which has a variety of nesting schemes. The test examples for the parser contain some examples of these from different online repositories:

  • Ensembl GFF2 — recognizes child features by a transcript_id attribute, and does not provide a parent feature
  • WormBase GFF2 — child features have a Transcript attribute for certain feature types; a parent feature is present, also with a Transcript attribute
  • JGI GFF2 — child features have a TranscriptId or ProteinId attribute and no parent feature

The updated parser handles all these styles of nesting, building a top level feature for those files where this parent is not present. This mimics the new GFF3 behavior to ease the transition to those files. Where parent features are missing, a new feature is created of type inferred_parent which spans the distance of all child features. These child features are available from the sub_features attribute of the parent.

These new updates improve parsing for older GFF2 files which are still widely used, and opens up parsing to new GFF files produced from SOLiD machines. The code is available from the standard github location. Please continue to pass along bug reports and suggestions.

Written by Brad Chapman

April 5, 2009 at 7:49 pm

Posted in OpenBio

Tagged with , ,

Python GFF parser update — parallel parsing and GFF2

with 9 comments

Parallel parsing

Last week we discussed refactoring the Python GFF parser to use a MapReduce framework. This was designed with the idea of being able to scale GFF parsing as file size increases. In addition to large files describing genome annotations, GFF is spreading to next-generation sequencing; SOLiD provides a tool to convert their mapping files to GFF.

Parallel processing introduces overhead due to software intermediates and networking costs. For the Disco implementation of GFF parsing, parsed lines run through Erlang and are translated to and from JSON strings. Invoking this overhead is worthwhile only if enough processors are utilized to overcome the slowdown. To estimate when we should start to parallelize, I looked at parsing a 1.5GB GFF file on a small multi-core machine and a remote cluster. Based on rough testing and non-scientific linear extrapolation of the results, I estimate 8 processors are needed to start to see a speed-up over local processing.

The starting baseline for parsing our 1.5GB file is one and half minutes using a single processor on my commodity Dell desktop. This desktop has 4 cores, and running Disco utilizing all 4 CPUs, the time increases to 3 minutes. Once Disco itself has been set up, switching between the two is seamless since the file is parsed in shared memory.

The advantage of utilizing Disco is that it can scale from this local implementation to very large clusters. Amazon’s Elastic Computing Cloud (EC2) is an amazing resource where you can quickly set up and run jobs on powerful hardware. It is essentially an instant on-demand cluster for running applications. Using the ElasticFox Firefox plugin and the setup directions for Disco on EC2, I was able to quickly test GFF parsing on a test cluster of three small (AMI ami-cfbc58a6, a Debian 5.0 Lenny instance) instances. For distributed jobs, the main challenges are setting up each of the cluster nodes with the software, and distributing the files across the nodes. Disco provides scripts to install itself across the cluster and to distribute the file being parsed. When you are attacking a GFF parsing job that is prohibitively slow or memory intensive on your local hardware, a small cluster of a few extra-large of extra-large high CPU instances on EC2 will help you overcome these limitations. Hopefully in the future Disco will become available on some standard Amazon machine images, lowering the threshold to getting a job running.

In practical terms, local GFF parsing will be fine for most standard files. When you are limited by parsing time with large files, attack the problem using either a local cluster or EC2 with 8 or more processors. To better utilize a small number of local CPUs, it makes sense to explore a light weight solution such as the new python multiprocessing module.

GFF2 support

The initial target for GFF parsing was the GFF3 standard. However, many genome centers still use the older GFF2 or GTF formats. The main parsing difference between these formats are the attributes. In GFF3, they look like:


while in GFF2 they are less standardized, and look like:

  Transcript "B0019.1" ; WormPep "WP:CE40797" ; Note "amx-2"

The parser has been updated to handle GFF2 attributes correctly, with test cases from several genome centers. In practice, there are several tricky implementations of the GFF2 specifications; if you find examples of incorrectly parsed attributes by the current parser, please pass them along.

GFF2 and GFF3 also differ in how nested features are handled. A standard example of nesting is specifying the coding regions of a transcript. Since GFF2 didn’t provide a default way to do this, there are several different methods used in practice. Currently, the parser leaves these GFF2 features as flat and you would need to write custom code on top of the parser to nest them if desired.

The latest version of the GFF parsing code is available from GitHub. To install it, click the download link on that page and you will get the whole directory along with a file to install it. It installs outside of Biopython since it is still under development. As always, I am happy to accept any contributions or suggestions.

Written by Brad Chapman

March 29, 2009 at 10:49 am

MapReduce implementation of GFF parsing for Biopython

with 4 comments

I previously wrote up details about starting a GFF parser for Biopython. In addition to incorporating suggestions received on the Biopython mailing list, it has been redesigned for parallel computation using Disco. Disco is an implementation of the distributed MapReduce framework in Erlang and Python. The code is available from the git repository; this post describes the impetus and design behind the MapReduce revision.

The scale of biological analyses is growing quickly thanks to new sequencing technologies. Bioinformatics programmers will need to learn techniques to rapidly analyze extremely large data sets. My coding toolbox has expanded to tackle these problems in two ways. The first is exploring programming languages designed for speed and parallelism, like Haskell. Additionally, I have been learning general techniques for parallelizing programs. Both require re-thinking code design to take advantage of increasingly available multi-processor and clustered architectures.

The MapReduce framework, originally proposed by Google, exemplifies the idea of redesigning code to analyze large data sets in parallel. In short, the programmer writes two functions: map and reduce. The map function handles the raw parsing work; for instance, it parses a line of GFF text and structures the details of interest. The reduce function combines the results from the map parsing, making them available for additional processing outside of the parallel part of the job. Several implementations of MapReduce have become popularly used. Apache’s Hadoop is a mature Java implementation with an underlying distributed file system. Here we utilize Disco, an implementation in Erlang and Python from Nokia Research Center.

The MapReduce GFF parser consists of two standalone functions. The map function takes a line of GFF text and first determines if we should parse it based on a set of limits. This allows the user to only pull items of interest from the GFF file, saving memory and time:

def _gff_line_map(line, params):
    strand_map = {'+' : 1, '-' : -1, '?' : None, None: None}
    line = line.strip()
    if line[0] != "#":
        parts = line.split('\t')
        should_do = True
        if params.limit_info:
            for limit_name, limit_values in params.limit_info.items():
                cur_id = tuple([parts[i] for i in 
                if cur_id not in limit_values:
                    should_do = False

If the GFF line is to be parsed, we use it to build a dictionary with all the details. Additionally, the line is classified as a top level annotation, a standard flat feature with a location, or part of a parent/child nested feature. The results are returned as a dictionary. For the disco parallel implementation, we use JSON to convert the dictionary into a flattened string:

        if should_do:
            assert len(parts) == 9, line
            gff_parts = [(None if p == '.' else p) for p in parts]
            gff_info = dict()
            # collect all of the base qualifiers for this item
            quals = collections.defaultdict(list)
            if gff_parts[1]:
            if gff_parts[5]:
            if gff_parts[7]:
            for key, val in [a.split('=') for a in gff_parts[8].split(';')]:
            gff_info['quals'] = dict(quals)
            gff_info['rec_id'] = gff_parts[0]
            # if we are describing a location, then we are a feature
            if gff_parts[3] and gff_parts[4]:
                gff_info['location'] = [int(gff_parts[3]) - 1,
                gff_info['type'] = gff_parts[2]
                gff_info['id'] = quals.get('ID', [''])[0]
                gff_info['strand'] = strand_map[gff_parts[6]]
                # Handle flat features
                if not gff_info['id']:
                    final_key = 'feature'
                # features that have parents need to link so we can pick up
                # the relationship
                elif gff_info['quals'].has_key('Parent'):
                    final_key = 'child'
                # top level features
                    final_key = 'parent'
            # otherwise, associate these annotations with the full record
                final_key = 'annotation'
            return [(final_key, (simplejson.dumps(gff_info) if params.jsonify
                else gff_info))]

The advantage of this distinct map function is that it can be run in parallel for any line in the file. To condense the results back into a synchronous world, the reduce function takes the results of the map function and combines them into a dictionary of results:

def _gff_line_reduce(map_results, out, params):
    final_items = dict()
    for gff_type, final_val in map_results:
        send_val = (simplejson.loads(final_val) if params.jsonify else 
        except KeyError:
            final_items[gff_type] = [send_val]
    for key, vals in final_items.items():
        out.add(key, (simplejson.dumps(vals) if params.jsonify else vals))

Finally, the dictionaries of GFF information are converted into Biopython SeqFeatures and attached to SeqRecord objects; the standard object interface is identical to that used for GenBank feature files.

Re-writing the code sped it up by a roughly calculated 10% for single processor work. Splitting up parsing and object creation allowed me to apply some simple speed ups which contributed to this improvement. The hidden advantage of learning new programming frameworks is that it encourages you to think about familiar problems in different ways.

This implementation is designed to work both in parallel using Disco, and locally on a standard machine. Practically, this means that Disco is not required unless you care about parallelizing the code. Parallel coding may also not be the right approach for a particular problem. For small files, it is more efficient to run the code locally and avoid the overhead involved with making it parallel.

When you suddenly need to apply your small GFF parsing script to many gigabytes of result data the code will scale accordingly by incorporating Disco. To look at the practical numbers related to this scaling, I plan to follow up on this post with tests using Disco on Amazon’s Elastic Compute Cloud.

Written by Brad Chapman

March 22, 2009 at 10:59 am

BioSQL on Google App Engine

with 4 comments

The BioSQL project provides a well thought out relational database schema for storing biological sequences and annotations. For those developers who are responsible for setting up local stores of biological data, BioSQL provides a huge advantage via reusability. Some of the best features of BioSQL from my experience are:

  • Available interfaces for several languages (via Biopython, BioPerl, BioJava and BioRuby).
  • Flexible storage of data via a key/value pair model. This models information in an extensible manner, and helps with understanding distributed key/value stores like SimpleDB and CouchDB.
  • Overall data model based on GenBank flat files. This makes teaching the model to biology oriented users much easier; you can pull up a text file from NCBI with a sequence and directly show how items map to the database.

Given the usefulness of BioSQL for local relational data storage, I would like to see it move into the rapidly expanding cloud development community. Tying BioSQL data storage in with Web frameworks will help researchers make their data publicly available earlier and in standard formats. As a nice recent example, George has a series of posts on using BioSQL with Ruby on Rails. There have also been several discussions of the BioSQL mailing list around standard web tools and APIs to sit on top of the database; see this thread for a recent example.

Towards these goals, I have been working on a BioSQL backed interface for Google App Engine. Google App Engine is a Python based framework to quickly develop and deploy web applications. For data storage, Google’s Datastore provides an object interface to a distributed scalable storage backend. Practically, App Engine has free hosting quotas which can scale to larger instances as demand for the data in the application increases; this will appeal to cost-conscious researchers by avoiding an initial barrier to making their data available.

My work on this was accelerated by the Open Bioinformatics Foundation’s move to apply for participation in Google’s Summer of Code. OpenBio is a great community that helps organize projects like BioPerl, Biopython, BioJava and BioRuby. After writing up a project idea for BioSQL on Google App Engine in our application, I was inspired to finish a demonstration of the idea.

I am happy to announce a simple demonstration server running a BioSQL based backend: BioSQL Web. The source code is available from my git repository. Currently the server allows uploads of GenBank formatted files and provides a simple view of the records, annotations and sequences. The data is stored in the Google Datastore with an object interface that mimics the BioSQL relational model.

Future posts will provide more details on the internals of the server end and client interface as they develop. As always, feedback, thoughts and code contributions are very welcome.

Written by Brad Chapman

March 15, 2009 at 8:45 am