Blue Collar Bioinformatics

Note: new posts have moved to Please look there for the latest updates and comments

Posts Tagged ‘gff

Examining and adjusting your GFF file

with 2 comments

Generic Feature Format (GFF) defines a standard template for representing biological features. Within this template, however, is room for flexibility. Different GFF producers may decide to format their data in slightly different ways. While parsing these files is not a problem, correctly interpreting and utilizing the data can be. The in-development Biopython GFF parser provides utilities to get a high level summary of the elements of a GFF file, and to adjust line items in the file during parsing.

Examining a GFF file

When downloading a new GFF file, the first step is getting an overview of the file contents. The GFF parser provides a GFFExaminer class to help with this. The first function of interest is available_limits:

import pprint
from BCBio.GFF.GFFParser import GFFExaminer
gff_examiner = GFFExaminer()
possible_limits = gff_examiner.available_limits(gff_file)

It returns a dictionary defining various ways to limit your parsing of the GFF file, along with the count of each item. As an example, here is a trimmed dump from one of the test files. This has features on two different record ids — chromosomes ‘I’ and ‘X’, with 159 and 6 items, respectively. Also listed are the combination of the 2nd source column and 3rd type column, and the type column by itself:

{'gff_id': {('I',): 159,
            ('X',): 6},
 'gff_source_type': {('Coding_transcript', 'CDS'): 27,
                     ('Coding_transcript', 'exon'): 33,
                     ('Coding_transcript', 'five_prime_UTR'): 4,
                     ('Coding_transcript', 'gene'): 2,
                     ('Coding_transcript', 'intron'): 29,
                     ('Coding_transcript', 'mRNA'): 4,
                     ('Coding_transcript', 'three_prime_UTR'): 3,
                     ('mass_spec_genome', 'translated_nucleotide_match'): 7},
 'gff_type': {('CDS',): 57,
              ('exon',): 33,
              ('five_prime_UTR',): 4,
              ('gene',): 2,
              ('intron',): 29,
              ('mRNA',): 4,
              ('three_prime_UTR',): 3,
              ('translated_nucleotide_match',): 7}}

In addition to the overview of file contents, nested relationships are another important component of the file to understand. A summary of these is available through the parent_child_map function:

import pprint
from BCBio.GFF.GFFParser import GFFExaminer
gff_examiner = GFFExaminer()
pc_map = gff_examiner.parent_child_map(gff_file)

Again, a dictionary is returned. The keys in the dictionary are parent source and type elements in the file, while the values are children of those elements. For instance, here is the dictionary for a three tiered relationship where genes have mRNAs, and each mRNA can have coding regions, exons, introns, and 5′ and 3′ untranslated regions:

{('Coding_transcript', 'gene'): [('Coding_transcript', 'mRNA')],
 ('Coding_transcript', 'mRNA'): [('Coding_transcript', 'CDS'),
                                 ('Coding_transcript', 'exon'),
                                 ('Coding_transcript', 'five_prime_UTR'),
                                 ('Coding_transcript', 'intron'),
                                 ('Coding_transcript', 'three_prime_UTR')]}

Adjusting GFF lines during parsing

Occasionally you may run into a file format that you can comprehend, but that does not match your expectations for where items should be. SOLiD GFF files are one example; many thanks are due to David Schruth, who has been patiently working with me on the parsing of these files. They have the read name in the first column, which is normally used for the record ID the feature maps to. The actual record ID is contained as an attribute, i=1, where the 1 corresponds to the index of record it maps to in the original FASTA alignment file:

3_336_815_F3    solid   read    55409   55428   10.4    +       .       i=1

The GFFAddingIterator has an optional attribute, line_adjust_fn, which can be used to solve this problem. The function is called each time a line is read and passed a parsed dictionary representing the line. The dictionary for the above line is:

{'id': '',
 'is_gff2': False,
 'location': [55408, 55428],
 'quals': {
           'i': ['1'],
           'score': ['10.4'],
           'source': ['solid']},
 'rec_id': '3_336_815_F3',
 'strand': 1,
 'type': 'read'}

The function takes this item as an argument and returns the dictionary, but with any adjustments that need to be made. In our example, we look up the name of the record corresponding to the i=1 index. This name is swapped in for the rec_id, while that information moves to an attribute named read_name:

from Bio import SeqIO
from BCBio.GFF.GFFParser import GFFAddingIterator

class SolidFastaRemap:
    def __init__(self, initial_fasta):
        self._in_map = self._get_index_map(initial_fasta)

    def _get_index_map(self, in_file):
        in_map = dict()
        in_handle = open(in_file)
        for index, rec in enumerate(SeqIO.parse(in_handle, "fasta")):
            in_map[index] =
        return in_map

    def adjust_fn(self, results):
        # 1-based indexes; convert to 0-based
        rec_index = int(results['quals']['i'][0]) - 1
        read_name = results['rec_id']
        results['quals']['read_name'] = [read_name]
        results['rec_id'] = self._in_map[rec_index]
        return results

remapper = SolidFastaRemap(fasta_file)
gff_iterator = GFFAddingIterator(line_adjust_fn=remapper.adjust_fn)
rec_dict = gff_iterator.get_all_features(gff_file)

This allows you to fix the file during the parsing, saving multiple passes through the file. This general functionality can be used to cleanly deal with any other inconsistencies that crop up during your GFF parsing adventures.

Written by Brad Chapman

April 12, 2009 at 8:55 pm

Posted in OpenBio

Tagged with ,

More python GFF parsing — iterative parsing and GFF2 nested features

leave a comment »

Work on the python generic feature format (GFF) parser continues to push forward; many thanks to those who have provided feedback in helping to refine the functionality. Previously, we discussed the initial implementation, introduced MapReduce parsing for parallelization, and discussed deploying on a cluster and GFF2 parsing. This week describes an interface for iterator based parsing of GFF files and nested features for GFF2 files.

Iterative parsing of GFF

GFF files are line-based and related features can be located anywhere in the file. To guarantee all features are parsed and combined correctly, the entire file needs to be scanned and loaded. For large files, we require strategies to load the data without abusing all available memory.

In some cases, it is known that parsing the file in chunks will not result in any information being lost. GFF files produced by SOLiD for short read alignments are one common case. These are read based, non-nested, and quite large. To tackle these files, the parser now has an iterator based interface that can be used to iterate over sections of the file:

from BCBio.GFF.GFFParser import GFFAddingIterator

gff_iterator = GFFAddingIterator()
for rec_dict in gff_iterator.get_features(gff_file, target_lines=3000000):
    for rec in rec_dict.values():
        # deal with rec.features    

This parses a file into ~350MB sized pieces, returning a dictionary of Biopython SeqRecord objects keyed by their names. Each SeqRecord contains all of the features added from that chunk of the file. These can be persisted to a database or otherwise analyzed before proceeding on to the next chunk, keeping memory requirements more reasonable.

The file can still be filtered by feature type, allowing extraction of only features of interest. This example uses the Biopython SeqIO interface to parse an initial FASTA file with our sequences, and then adds coding sequences on chromosome one to it in chunks of 1 million lines:

from BCBio.GFF.GFFParser import GFFAddingIterator
from Bio import SeqIO

with open(seq_file) as seq_handle:
    seq_dict = SeqIO.to_dict(SeqIO.parse(seq_handle, "fasta"))
gff_iterator = GFFAddingIterator(seq_dict)
cds_limit_info = dict(
        gff_types = [('Coding_transcript', 'gene'),
                     ('Coding_transcript', 'mRNA'),
                     ('Coding_transcript', 'CDS')],
        gff_id = ['I']
for rec_dict in gff_iterator.get_features(gff_file,
        limit_info=cds_limit_info, target_lines=1000000):
    for rec in rec_dict.values():
        # deal with rec.features    

To avoid missing nested features, the iterator makes smart decisions about when to break the file. It is broken at points where all child features have their parents and can be expected to be nested correctly.

Nested features for GFF2

Nesting of features is handled nicely in the new GFF3 format. However, many sources provide information in the older GFF2 (also called GTF) format, which has a variety of nesting schemes. The test examples for the parser contain some examples of these from different online repositories:

  • Ensembl GFF2 — recognizes child features by a transcript_id attribute, and does not provide a parent feature
  • WormBase GFF2 — child features have a Transcript attribute for certain feature types; a parent feature is present, also with a Transcript attribute
  • JGI GFF2 — child features have a TranscriptId or ProteinId attribute and no parent feature

The updated parser handles all these styles of nesting, building a top level feature for those files where this parent is not present. This mimics the new GFF3 behavior to ease the transition to those files. Where parent features are missing, a new feature is created of type inferred_parent which spans the distance of all child features. These child features are available from the sub_features attribute of the parent.

These new updates improve parsing for older GFF2 files which are still widely used, and opens up parsing to new GFF files produced from SOLiD machines. The code is available from the standard github location. Please continue to pass along bug reports and suggestions.

Written by Brad Chapman

April 5, 2009 at 7:49 pm

Posted in OpenBio

Tagged with , ,

Python GFF parser update — parallel parsing and GFF2

with 9 comments

Parallel parsing

Last week we discussed refactoring the Python GFF parser to use a MapReduce framework. This was designed with the idea of being able to scale GFF parsing as file size increases. In addition to large files describing genome annotations, GFF is spreading to next-generation sequencing; SOLiD provides a tool to convert their mapping files to GFF.

Parallel processing introduces overhead due to software intermediates and networking costs. For the Disco implementation of GFF parsing, parsed lines run through Erlang and are translated to and from JSON strings. Invoking this overhead is worthwhile only if enough processors are utilized to overcome the slowdown. To estimate when we should start to parallelize, I looked at parsing a 1.5GB GFF file on a small multi-core machine and a remote cluster. Based on rough testing and non-scientific linear extrapolation of the results, I estimate 8 processors are needed to start to see a speed-up over local processing.

The starting baseline for parsing our 1.5GB file is one and half minutes using a single processor on my commodity Dell desktop. This desktop has 4 cores, and running Disco utilizing all 4 CPUs, the time increases to 3 minutes. Once Disco itself has been set up, switching between the two is seamless since the file is parsed in shared memory.

The advantage of utilizing Disco is that it can scale from this local implementation to very large clusters. Amazon’s Elastic Computing Cloud (EC2) is an amazing resource where you can quickly set up and run jobs on powerful hardware. It is essentially an instant on-demand cluster for running applications. Using the ElasticFox Firefox plugin and the setup directions for Disco on EC2, I was able to quickly test GFF parsing on a test cluster of three small (AMI ami-cfbc58a6, a Debian 5.0 Lenny instance) instances. For distributed jobs, the main challenges are setting up each of the cluster nodes with the software, and distributing the files across the nodes. Disco provides scripts to install itself across the cluster and to distribute the file being parsed. When you are attacking a GFF parsing job that is prohibitively slow or memory intensive on your local hardware, a small cluster of a few extra-large of extra-large high CPU instances on EC2 will help you overcome these limitations. Hopefully in the future Disco will become available on some standard Amazon machine images, lowering the threshold to getting a job running.

In practical terms, local GFF parsing will be fine for most standard files. When you are limited by parsing time with large files, attack the problem using either a local cluster or EC2 with 8 or more processors. To better utilize a small number of local CPUs, it makes sense to explore a light weight solution such as the new python multiprocessing module.

GFF2 support

The initial target for GFF parsing was the GFF3 standard. However, many genome centers still use the older GFF2 or GTF formats. The main parsing difference between these formats are the attributes. In GFF3, they look like:


while in GFF2 they are less standardized, and look like:

  Transcript "B0019.1" ; WormPep "WP:CE40797" ; Note "amx-2"

The parser has been updated to handle GFF2 attributes correctly, with test cases from several genome centers. In practice, there are several tricky implementations of the GFF2 specifications; if you find examples of incorrectly parsed attributes by the current parser, please pass them along.

GFF2 and GFF3 also differ in how nested features are handled. A standard example of nesting is specifying the coding regions of a transcript. Since GFF2 didn’t provide a default way to do this, there are several different methods used in practice. Currently, the parser leaves these GFF2 features as flat and you would need to write custom code on top of the parser to nest them if desired.

The latest version of the GFF parsing code is available from GitHub. To install it, click the download link on that page and you will get the whole directory along with a file to install it. It installs outside of Biopython since it is still under development. As always, I am happy to accept any contributions or suggestions.

Written by Brad Chapman

March 29, 2009 at 10:49 am

MapReduce implementation of GFF parsing for Biopython

with 4 comments

I previously wrote up details about starting a GFF parser for Biopython. In addition to incorporating suggestions received on the Biopython mailing list, it has been redesigned for parallel computation using Disco. Disco is an implementation of the distributed MapReduce framework in Erlang and Python. The code is available from the git repository; this post describes the impetus and design behind the MapReduce revision.

The scale of biological analyses is growing quickly thanks to new sequencing technologies. Bioinformatics programmers will need to learn techniques to rapidly analyze extremely large data sets. My coding toolbox has expanded to tackle these problems in two ways. The first is exploring programming languages designed for speed and parallelism, like Haskell. Additionally, I have been learning general techniques for parallelizing programs. Both require re-thinking code design to take advantage of increasingly available multi-processor and clustered architectures.

The MapReduce framework, originally proposed by Google, exemplifies the idea of redesigning code to analyze large data sets in parallel. In short, the programmer writes two functions: map and reduce. The map function handles the raw parsing work; for instance, it parses a line of GFF text and structures the details of interest. The reduce function combines the results from the map parsing, making them available for additional processing outside of the parallel part of the job. Several implementations of MapReduce have become popularly used. Apache’s Hadoop is a mature Java implementation with an underlying distributed file system. Here we utilize Disco, an implementation in Erlang and Python from Nokia Research Center.

The MapReduce GFF parser consists of two standalone functions. The map function takes a line of GFF text and first determines if we should parse it based on a set of limits. This allows the user to only pull items of interest from the GFF file, saving memory and time:

def _gff_line_map(line, params):
    strand_map = {'+' : 1, '-' : -1, '?' : None, None: None}
    line = line.strip()
    if line[0] != "#":
        parts = line.split('\t')
        should_do = True
        if params.limit_info:
            for limit_name, limit_values in params.limit_info.items():
                cur_id = tuple([parts[i] for i in 
                if cur_id not in limit_values:
                    should_do = False

If the GFF line is to be parsed, we use it to build a dictionary with all the details. Additionally, the line is classified as a top level annotation, a standard flat feature with a location, or part of a parent/child nested feature. The results are returned as a dictionary. For the disco parallel implementation, we use JSON to convert the dictionary into a flattened string:

        if should_do:
            assert len(parts) == 9, line
            gff_parts = [(None if p == '.' else p) for p in parts]
            gff_info = dict()
            # collect all of the base qualifiers for this item
            quals = collections.defaultdict(list)
            if gff_parts[1]:
            if gff_parts[5]:
            if gff_parts[7]:
            for key, val in [a.split('=') for a in gff_parts[8].split(';')]:
            gff_info['quals'] = dict(quals)
            gff_info['rec_id'] = gff_parts[0]
            # if we are describing a location, then we are a feature
            if gff_parts[3] and gff_parts[4]:
                gff_info['location'] = [int(gff_parts[3]) - 1,
                gff_info['type'] = gff_parts[2]
                gff_info['id'] = quals.get('ID', [''])[0]
                gff_info['strand'] = strand_map[gff_parts[6]]
                # Handle flat features
                if not gff_info['id']:
                    final_key = 'feature'
                # features that have parents need to link so we can pick up
                # the relationship
                elif gff_info['quals'].has_key('Parent'):
                    final_key = 'child'
                # top level features
                    final_key = 'parent'
            # otherwise, associate these annotations with the full record
                final_key = 'annotation'
            return [(final_key, (simplejson.dumps(gff_info) if params.jsonify
                else gff_info))]

The advantage of this distinct map function is that it can be run in parallel for any line in the file. To condense the results back into a synchronous world, the reduce function takes the results of the map function and combines them into a dictionary of results:

def _gff_line_reduce(map_results, out, params):
    final_items = dict()
    for gff_type, final_val in map_results:
        send_val = (simplejson.loads(final_val) if params.jsonify else 
        except KeyError:
            final_items[gff_type] = [send_val]
    for key, vals in final_items.items():
        out.add(key, (simplejson.dumps(vals) if params.jsonify else vals))

Finally, the dictionaries of GFF information are converted into Biopython SeqFeatures and attached to SeqRecord objects; the standard object interface is identical to that used for GenBank feature files.

Re-writing the code sped it up by a roughly calculated 10% for single processor work. Splitting up parsing and object creation allowed me to apply some simple speed ups which contributed to this improvement. The hidden advantage of learning new programming frameworks is that it encourages you to think about familiar problems in different ways.

This implementation is designed to work both in parallel using Disco, and locally on a standard machine. Practically, this means that Disco is not required unless you care about parallelizing the code. Parallel coding may also not be the right approach for a particular problem. For small files, it is more efficient to run the code locally and avoid the overhead involved with making it parallel.

When you suddenly need to apply your small GFF parsing script to many gigabytes of result data the code will scale accordingly by incorporating Disco. To look at the practical numbers related to this scaling, I plan to follow up on this post with tests using Disco on Amazon’s Elastic Compute Cloud.

Written by Brad Chapman

March 22, 2009 at 10:59 am

Initial GFF parser for Biopython

with 10 comments

Generic feature format (GFF) is a nice plain text file format for storing annotations on biological sequences, and would be very useful tied in with the BioSQL relational database. Two weeks ago, I detailed the Bioperl GenBank to GFF mapping, which provided some introductory background to the problem. Here I’m continuing to explore GFF and BioSQL together, but from the opposite direction.

I implemented an initial pass at a python GFF parser that will hopefully eventually be included in Biopython. You can find the current code in the git repository. I’d be very happy to have others help with development, provide usage feedback and pass along difficult GFF files for testing.


GFF parsing is a little tricker to fit into the Biopython SeqIO system than other sequence file formats. Formats like GenBank or UniProt contain a sequence and its features combined into a single record. This allows parsers to iterate over files one at a time, returning generic sequence objects from each record. This scales with large files so your memory and processor worries are bounded by the most complicated record in the file.

In contrast, GFF files are separate from the primary sequences and do not have any guarantees about annotations for a record being grouped together. To be sure you’ve picked up all features for a record, you need to parse the entire GFF file. For large real-life files this becomes a problem as all of the features being added will rapidly fill up available memory.

To solve these problems, GFF parsing is implemented here as a feature addition module with filtering. This means that you first use standard Biopython SeqIO to parse the base sequence records, and then use the GFF class to add features to these initial records. The addition function has an option argument allowing added features to be limited to a subset of interest. You can limit based on record names and add all features related to a specific sequence, or you can limit based on types of GFF features and add these features to all records in the file.

This example demonstrates the use of the GFF parser to parse out all of the coding sequence features for chromosome one ('I'), and add them to the initial record:

from Bio import SeqIO
from BCBio.SeqIO.GFFIO import GFFFeatureAdder

with open(seq_file) as seq_handle:
    seq_dict = SeqIO.to_dict(SeqIO.parse(seq_handle, "fasta"))
feature_adder = GFFFeatureAdder(seq_dict)
cds_limit_info = dict(
        gff_types = [('Coding_transcript', 'gene'),
                     ('Coding_transcript', 'mRNA'),
                     ('Coding_transcript', 'CDS')],
        gff_id = ['I']
with open(gff_file) as gff_handle:
    feature_adder.add_features(gff_handle, cds_limit_info)
final_rec = feature_adder.base['I']

This example shows the other unique aspect of GFF parsing: nested features. In the example above we pull out coding genes, mRNA transcripts, and coding sequences (CDS). These are nested as a gene can have multiple mRNAs, and CDSs are mapped to one or more mRNA transcripts. In Biopython this is handled naturally using the sub_feature attribute of SeqFeature. So when handling the record, you will dig into a gene feature to find its transcripts and coding sequences. For a more detailed description of how GFF can be mapped to complex transcripts, see the GFF3 documentation, which has diagrams and examples of different biological cases and how they are represented.


The test code features several other usage examples which should help provide familiarity with the interface. For real life testing, this was run against the latest C elegans WormBase release, WS199: GFF; sequences. On a standard single processor workstation, the code took about 2 and a half minutes to parse all PCR products and coding sequences from the 1.3G GFF file and 100M genome fasta file.


To go full circle back to my initial inspiration, the parsed GFF was pushed into a BioSQL database using this script. To test on your own machine, you will have to adjust the database details at the start of the script to match your local configuration instead of my test database.

Standard flattened features are well supported going into BioSQL. Nested features, like the coding sequence representation mentioned above, will need additional work. The current loader only utilizes sub_features to get location information and support the join(1..3,5..8) syntax of GenBank. The seqfeature_relationship table in BioSQL seems like the right place to start to support this.


This provides an initial implementation of GFF3 parsing support for Biopython. The interface is a proposal and I welcome suggestions on making it more intuitive. Code and test example contributions are also much appreciated. As we find an interface and implementation that works for the python community and the code stabilizes, we can work to integrate this into the Biopython project.

Written by Brad Chapman

March 8, 2009 at 11:01 am

Posted in OpenBio

Tagged with , , , ,

Exploring BioPerl GenBank to GFF mapping

leave a comment »

A mailing list message from Peter about importing GFF files to BioSQL inspired me to take a look at how BioPerl treats GFF files. Generic Feature Format (GFF) is a plain text file format used to represent annotations and features on biological sequences. It is a nice biological file format:

  • Parsed relatively easily.
  • Human readable and editable in Excel.
  • Quickly understood at a basic level.
  • Flexible and adapting. GFF3, the current format, handles a number of incompatibility issues that arose in GFF2.
  • Widely used.

BioSQL is a relational database model that stores annotations and features on sequences. As Peter implies, having a general mapping between the two would facilitate plain text database dumps from BioSQL databases in GFF. Conversely, GFF formatted files could be loaded directly into BioSQL databases.

The BioSQL object model maps very closely to the GenBank file format, so a good way to examine the BioPerl to BioSQL mapping is to produce GFF from a GenBank file. The BioPerl distribution contains a script to do exactly this: -out stdout > cbx8.gff

Starting with this straightforward GenBank file, the above command produces a GFF file that I will explore more below. GFF files are structured as tab delimited columns. The first 8 columns describe the exact sequence location and contain a Sequence Ontology term describing the relationship between the annotation and the sequence region. The final column is a set of key-value pairs with the annotation data. For example, here is a line from our output file:

NM_001078975    GenBank gene    1       1847    .       +       .       
ID=cbx8;Dbxref=GeneID:779897;Note=chromobox homolog 8;gene=cbx8;

This maps directly to the corresponding feature in the original GenBank table:

     gene            1..1847
                     /note="chromobox homolog 8"

This is a nice one-to-one mapping of the GenBank feature table. The ontology for mapping feature keys to the sequence ontology terms was discussed in more detail in an earlier post on BioSQL ontologies. Here, the qualifier names map to uppercase standard keys where possible (Note, DBxref) and all lowercase names where they do not characterize a standard term. For BioSQL, these GFF lines would map directly into the seqfeature table, with a dictionary to provide the back and forth mapping between standard terms and qualifier names.

The less straightforward part of the mapping involves the high level annotations which describe the entire sequence. This corresponds to the header section in the GenBank file and maps to several specialized tables in the BioSQL schema. Here is a summary of the current mappings in BioPerl GFF:

GenBank BioSQL table Current BioPerl GFF Proposed GFF key/value


LOCUS; Molecule type


LOCUS; division


LOCUS; date

term date_changed



Note, but combined with COMMENT description

accession and version




term keywords

organism and Dbxref to taxon ID Full lineage needs representation as well
  Dbxref for PubMed IDs; need to store full reference information as well
comment1 and Note, combined with DEFINITION comment1 only

Most of the major mappings are in place, with some naming refinement needed. The most complicated outstanding aspect would be storing the reference journal information. Someone more familiar with GFF may be able to offer a solution that has been used previously. My guess at this point is that each reference would be a separate GFF line item, with key/value pairs for the authors, title and other information.

Overall, GFF offers a nice flat file output format for BioSQL databases. Much of the mapping from GFF to BioSQL is in place currently in BioPerl, with consensus needed for the missing parts. With that established, the other languages that support BioSQL can follow the BioPerl mapping. In my view, being able to round-trip between GFF flat files and the BioSQL relational database would help drive usage of both.

Edit: James Procter put together a BioSQL wiki page to help specify the mapping. Please help contribute there and ask questions on the BioSQL mailing list.

Written by Brad Chapman

February 22, 2009 at 3:56 pm

Posted in OpenBio

Tagged with , , ,