Blue Collar Bioinformatics

Note: new posts have moved to http://bcb.io/ Please look there for the latest updates and comments

Initial GFF parser for Biopython

with 10 comments

Generic feature format (GFF) is a nice plain text file format for storing annotations on biological sequences, and would be very useful tied in with the BioSQL relational database. Two weeks ago, I detailed the Bioperl GenBank to GFF mapping, which provided some introductory background to the problem. Here I’m continuing to explore GFF and BioSQL together, but from the opposite direction.

I implemented an initial pass at a python GFF parser that will hopefully eventually be included in Biopython. You can find the current code in the git repository. I’d be very happy to have others help with development, provide usage feedback and pass along difficult GFF files for testing.

Implementation

GFF parsing is a little tricker to fit into the Biopython SeqIO system than other sequence file formats. Formats like GenBank or UniProt contain a sequence and its features combined into a single record. This allows parsers to iterate over files one at a time, returning generic sequence objects from each record. This scales with large files so your memory and processor worries are bounded by the most complicated record in the file.

In contrast, GFF files are separate from the primary sequences and do not have any guarantees about annotations for a record being grouped together. To be sure you’ve picked up all features for a record, you need to parse the entire GFF file. For large real-life files this becomes a problem as all of the features being added will rapidly fill up available memory.

To solve these problems, GFF parsing is implemented here as a feature addition module with filtering. This means that you first use standard Biopython SeqIO to parse the base sequence records, and then use the GFF class to add features to these initial records. The addition function has an option argument allowing added features to be limited to a subset of interest. You can limit based on record names and add all features related to a specific sequence, or you can limit based on types of GFF features and add these features to all records in the file.

This example demonstrates the use of the GFF parser to parse out all of the coding sequence features for chromosome one ('I'), and add them to the initial record:

from Bio import SeqIO
from BCBio.SeqIO.GFFIO import GFFFeatureAdder

with open(seq_file) as seq_handle:
    seq_dict = SeqIO.to_dict(SeqIO.parse(seq_handle, "fasta"))
feature_adder = GFFFeatureAdder(seq_dict)
cds_limit_info = dict(
        gff_types = [('Coding_transcript', 'gene'),
                     ('Coding_transcript', 'mRNA'),
                     ('Coding_transcript', 'CDS')],
        gff_id = ['I']
        )
with open(gff_file) as gff_handle:
    feature_adder.add_features(gff_handle, cds_limit_info)
final_rec = feature_adder.base['I']

This example shows the other unique aspect of GFF parsing: nested features. In the example above we pull out coding genes, mRNA transcripts, and coding sequences (CDS). These are nested as a gene can have multiple mRNAs, and CDSs are mapped to one or more mRNA transcripts. In Biopython this is handled naturally using the sub_feature attribute of SeqFeature. So when handling the record, you will dig into a gene feature to find its transcripts and coding sequences. For a more detailed description of how GFF can be mapped to complex transcripts, see the GFF3 documentation, which has diagrams and examples of different biological cases and how they are represented.

Testing

The test code features several other usage examples which should help provide familiarity with the interface. For real life testing, this was run against the latest C elegans WormBase release, WS199: GFF; sequences. On a standard single processor workstation, the code took about 2 and a half minutes to parse all PCR products and coding sequences from the 1.3G GFF file and 100M genome fasta file.

BioSQL

To go full circle back to my initial inspiration, the parsed GFF was pushed into a BioSQL database using this script. To test on your own machine, you will have to adjust the database details at the start of the script to match your local configuration instead of my test database.

Standard flattened features are well supported going into BioSQL. Nested features, like the coding sequence representation mentioned above, will need additional work. The current loader only utilizes sub_features to get location information and support the join(1..3,5..8) syntax of GenBank. The seqfeature_relationship table in BioSQL seems like the right place to start to support this.

Summary

This provides an initial implementation of GFF3 parsing support for Biopython. The interface is a proposal and I welcome suggestions on making it more intuitive. Code and test example contributions are also much appreciated. As we find an interface and implementation that works for the python community and the code stabilizes, we can work to integrate this into the Biopython project.

Written by Brad Chapman

March 8, 2009 at 11:01 am

Posted in OpenBio

Tagged with , , , ,

10 Responses

Subscribe to comments with RSS.

  1. hi,
    thank you for doing this.. I have used gff a lot during my master’s project, but now I am using different formats.

    I just wanted to say that if you want, you can use the unofficial biopython migration branch on github to keep this code synchronized with the biopython repository.

    You’ll have to fork the biopython branch here:
    http://github.com/biopython/biopython/network
    and push your changes to your forked branch.

    dalloliogm

    March 11, 2009 at 3:59 am

    • Giovanni;
      No problem. Hopefully we will get the module in good enough shape that it’ll be ready the next time you need to parse GFF.

      Thanks for the suggested patches for the tests on github. I’ve got a couple of questions for you about your style; it differs a bit from what I’ve done in the past on testing. Why do you prefer to have the defined stuff be class attributes instead of just sticking them in the setUp function? Is it because of performance concerns?

      Also, I noticed your doctest note. I always felt the unittest approach is more natural for development; I write the tests as I write the code, formalizing areas where I imagine errors sneaking in during future development. Why do you prefer doctest?

      I’d like to try and sync this with Biopython git when it is finalized; it will hopefully help me learn git more thoroughly. I’m still getting over my fears of branching and merging from many years of CVS/SVN work.

      Thanks again for the feedback,
      Brad

      Brad Chapman

      March 11, 2009 at 4:42 pm

  2. Hi Brad,

    Have the GFF parsers come on much since this post? I’m hoping to use BioPython and PyCogent for comparative genomics analyses and would like to write some wrappers to pull data in from multiple genome databases into GFF3, so I can work with the data locally in a common format!

    Here’s a post I made on BioStar http://biostar.stackexchange.com/questions/3566/how-can-i-unify-genome-data-from-different-sources

    Cheers,

    Steve

    gawbul

    November 13, 2010 at 3:59 am

    • Steve;
      Glad you are interested in using it. The latest documentation for use is on the Biopython website:

      http://biopython.org/wiki/GFF_Parsing

      The module works great for GFF data I’ve been using in my regular research. If you have any questions or find GFF that is giving it issues, definitely let me know.

      Thanks,
      Brad

      Brad Chapman

      November 16, 2010 at 9:34 am

      • Thanks for the reply Brad!

        Does the module also support inline FASTA? I notice from some parsers, particularly for GFF2 that you have to load the FASTA and GFF files separately.

        I’d like to be able to parse the GFF file and then have access to the sequence for the particular annotation objects.

        Cheers,

        Steve

        Steve Moss

        November 16, 2010 at 10:00 am

        • Steve;
          Yes it supports the ‘##FASTA’ directive so you can include FASTA in the GFF file. However, better practice is to keep the FASTA and GFF separate, since you’ll likely need FASTA files in programs where you don’t want to parse it out of a GFF file first.

          Brad Chapman

          November 17, 2010 at 7:10 am

  3. Hi Brad. Thanks for your work.

    I trying your GFF parser software to write gff files from genbank files and subsequently use AnnotatationSketch from genometools.org to draw some schematics.

    They have an online demo here: http://genometools.org/cgi-bin/annotationsketch_demo.cgi
    Asuming that tool is parsing GFFs properly you could use that to test if your parser is working ok.

    I tried here and it seems to fail with gt.core.error.GTError: GenomeTools error: could not parse number ‘.’ on line 3 in file ‘/home/merc/gitcode/mirna-django/src/scripts/tp53.gff’

    The gff3 parsed from the biopearl tool bp_genbank2gff3 works so I am asuming your parser put a dot in the wrong place.

    Let me know if I can help.

    macabro22

    March 9, 2012 at 11:14 pm


Leave a comment