Initial GFF parser for Biopython

Generic feature format (GFF) is a nice plain text file format for storing annotations on biological sequences, and would be very useful tied in with the BioSQL relational database. Two weeks ago, I detailed the Bioperl GenBank to GFF mapping, which provided some introductory background to the problem. Here I’m continuing to explore GFF and BioSQL together, but from the opposite direction.

I implemented an initial pass at a python GFF parser that will hopefully eventually be included in Biopython. You can find the current code in the git repository. I’d be very happy to have others help with development, provide usage feedback and pass along difficult GFF files for testing.

Implementation

GFF parsing is a little tricker to fit into the Biopython SeqIO system than other sequence file formats. Formats like GenBank or UniProt contain a sequence and its features combined into a single record. This allows parsers to iterate over files one at a time, returning generic sequence objects from each record. This scales with large files so your memory and processor worries are bounded by the most complicated record in the file.

In contrast, GFF files are separate from the primary sequences and do not have any guarantees about annotations for a record being grouped together. To be sure you’ve picked up all features for a record, you need to parse the entire GFF file. For large real-life files this becomes a problem as all of the features being added will rapidly fill up available memory.

To solve these problems, GFF parsing is implemented here as a feature addition module with filtering. This means that you first use standard Biopython SeqIO to parse the base sequence records, and then use the GFF class to add features to these initial records. The addition function has an option argument allowing added features to be limited to a subset of interest. You can limit based on record names and add all features related to a specific sequence, or you can limit based on types of GFF features and add these features to all records in the file.

This example demonstrates the use of the GFF parser to parse out all of the coding sequence features for chromosome one ('I'), and add them to the initial record:

from Bio import SeqIO
from BCBio.SeqIO.GFFIO import GFFFeatureAdder

with open(seq_file) as seq_handle:
    seq_dict = SeqIO.to_dict(SeqIO.parse(seq_handle, "fasta"))
feature_adder = GFFFeatureAdder(seq_dict)
cds_limit_info = dict(
        gff_types = [('Coding_transcript', 'gene'),
                     ('Coding_transcript', 'mRNA'),
                     ('Coding_transcript', 'CDS')],
        gff_id = ['I']
        )
with open(gff_file) as gff_handle:
    feature_adder.add_features(gff_handle, cds_limit_info)
final_rec = feature_adder.base['I']

This example shows the other unique aspect of GFF parsing: nested features. In the example above we pull out coding genes, mRNA transcripts, and coding sequences (CDS). These are nested as a gene can have multiple mRNAs, and CDSs are mapped to one or more mRNA transcripts. In Biopython this is handled naturally using the sub_feature attribute of SeqFeature. So when handling the record, you will dig into a gene feature to find its transcripts and coding sequences. For a more detailed description of how GFF can be mapped to complex transcripts, see the GFF3 documentation, which has diagrams and examples of different biological cases and how they are represented.

Testing

The test code features several other usage examples which should help provide familiarity with the interface. For real life testing, this was run against the latest C elegans WormBase release, WS199: GFF; sequences. On a standard single processor workstation, the code took about 2 and a half minutes to parse all PCR products and coding sequences from the 1.3G GFF file and 100M genome fasta file.

BioSQL

To go full circle back to my initial inspiration, the parsed GFF was pushed into a BioSQL database using this script. To test on your own machine, you will have to adjust the database details at the start of the script to match your local configuration instead of my test database.

Standard flattened features are well supported going into BioSQL. Nested features, like the coding sequence representation mentioned above, will need additional work. The current loader only utilizes sub_features to get location information and support the join(1..3,5..8) syntax of GenBank. The seqfeature_relationship table in BioSQL seems like the right place to start to support this.

Summary

This provides an initial implementation of GFF3 parsing support for Biopython. The interface is a proposal and I welcome suggestions on making it more intuitive. Code and test example contributions are also much appreciated. As we find an interface and implementation that works for the python community and the code stabilizes, we can work to integrate this into the Biopython project.

Written by Brad Chapman

March 8, 2009 at 11:01 am

Posted in OpenBio

Tagged with bioinformatics, biopython, biosql, gff, python

10 Responses

Subscribe to comments with RSS.

hi,
thank you for doing this.. I have used gff a lot during my master’s project, but now I am using different formats.

I just wanted to say that if you want, you can use the unofficial biopython migration branch on github to keep this code synchronized with the biopython repository.

You’ll have to fork the biopython branch here:
– http://github.com/biopython/biopython/network
and push your changes to your forked branch.

dalloliogm

March 11, 2009 at 3:59 am

Reply
- Giovanni;
  No problem. Hopefully we will get the module in good enough shape that it’ll be ready the next time you need to parse GFF.
  
  Thanks for the suggested patches for the tests on github. I’ve got a couple of questions for you about your style; it differs a bit from what I’ve done in the past on testing. Why do you prefer to have the defined stuff be class attributes instead of just sticking them in the setUp function? Is it because of performance concerns?
  
  Also, I noticed your doctest note. I always felt the unittest approach is more natural for development; I write the tests as I write the code, formalizing areas where I imagine errors sneaking in during future development. Why do you prefer doctest?
  
  I’d like to try and sync this with Biopython git when it is finalized; it will hopefully help me learn git more thoroughly. I’m still getting over my fears of branching and merging from many years of CVS/SVN work.
  
  Thanks again for the feedback,
  Brad
  
  Brad Chapman
  
  March 11, 2009 at 4:42 pm
  
  Reply
Hi Brad,

Have the GFF parsers come on much since this post? I’m hoping to use BioPython and PyCogent for comparative genomics analyses and would like to write some wrappers to pull data in from multiple genome databases into GFF3, so I can work with the data locally in a common format!

Here’s a post I made on BioStar http://biostar.stackexchange.com/questions/3566/how-can-i-unify-genome-data-from-different-sources

Cheers,

Steve

gawbul

November 13, 2010 at 3:59 am

Reply
- Steve;
  Glad you are interested in using it. The latest documentation for use is on the Biopython website:
  
  http://biopython.org/wiki/GFF_Parsing
  
  The module works great for GFF data I’ve been using in my regular research. If you have any questions or find GFF that is giving it issues, definitely let me know.
  
  Thanks,
  Brad
  
  Brad Chapman
  
  November 16, 2010 at 9:34 am
  
  Reply
  - Thanks for the reply Brad!
    
    Does the module also support inline FASTA? I notice from some parsers, particularly for GFF2 that you have to load the FASTA and GFF files separately.
    
    I’d like to be able to parse the GFF file and then have access to the sequence for the particular annotation objects.
    
    Cheers,
    
    Steve
    
    Steve Moss
    
    November 16, 2010 at 10:00 am
    
    Reply
    - Steve;
      Yes it supports the ‘##FASTA’ directive so you can include FASTA in the GFF file. However, better practice is to keep the FASTA and GFF separate, since you’ll likely need FASTA files in programs where you don’t want to parse it out of a GFF file first.
      
      Brad Chapman
      
      November 17, 2010 at 7:10 am
      
      Reply
Hi Brad. Thanks for your work.

I trying your GFF parser software to write gff files from genbank files and subsequently use AnnotatationSketch from genometools.org to draw some schematics.

They have an online demo here: http://genometools.org/cgi-bin/annotationsketch_demo.cgi
Asuming that tool is parsing GFFs properly you could use that to test if your parser is working ok.

I tried here and it seems to fail with gt.core.error.GTError: GenomeTools error: could not parse number ‘.’ on line 3 in file ‘/home/merc/gitcode/mirna-django/src/scripts/tp53.gff’

The gff3 parsed from the biopearl tool bp_genbank2gff3 works so I am asuming your parser put a dot in the wrong place.

Let me know if I can help.

macabro22

March 9, 2012 at 11:14 pm

Reply
- Hugo,
  Thanks for the report. Would you be able to send along some of ‘tp53.gff’? From the error message it sounds like I’ll need at least the first 3 lines. If you could post this on GitHub I can take a look.
  
  Brad Chapman
  
  March 10, 2012 at 8:01 am
  
  Reply
  - Sure Brad. Here is the reply from Gnometools’ author Sasha: http://genometools.lighthouseapp.com/projects/17855/tickets/61-error-parsing-gff3-file
    
    The tp53 file on github: https://github.com/mercutio22/biopython/blob/master/Tests/sample_files/tp53.gff
    
    macabro22
    
    March 10, 2012 at 3:22 pm
    
    Reply
    - Hugo,
      Thanks again. I checked in fixes so the GenBank to GFF conversion should be cleaner and be handled by genometools:
      
      https://github.com/chapmanb/bcbb/commit/540b030c40e17a6a76b173aadefa8beb56a660f0
      
      let me know if you run into other issues.
      
      Brad Chapman
      
      March 10, 2012 at 5:13 pm
      
      Reply

Blue Collar Bioinformatics