Python GFF parser update — parallel parsing and GFF2

Parallel parsing

Last week we discussed refactoring the Python GFF parser to use a MapReduce framework. This was designed with the idea of being able to scale GFF parsing as file size increases. In addition to large files describing genome annotations, GFF is spreading to next-generation sequencing; SOLiD provides a tool to convert their mapping files to GFF.

Parallel processing introduces overhead due to software intermediates and networking costs. For the Disco implementation of GFF parsing, parsed lines run through Erlang and are translated to and from JSON strings. Invoking this overhead is worthwhile only if enough processors are utilized to overcome the slowdown. To estimate when we should start to parallelize, I looked at parsing a 1.5GB GFF file on a small multi-core machine and a remote cluster. Based on rough testing and non-scientific linear extrapolation of the results, I estimate 8 processors are needed to start to see a speed-up over local processing.

The starting baseline for parsing our 1.5GB file is one and half minutes using a single processor on my commodity Dell desktop. This desktop has 4 cores, and running Disco utilizing all 4 CPUs, the time increases to 3 minutes. Once Disco itself has been set up, switching between the two is seamless since the file is parsed in shared memory.

The advantage of utilizing Disco is that it can scale from this local implementation to very large clusters. Amazon’s Elastic Computing Cloud (EC2) is an amazing resource where you can quickly set up and run jobs on powerful hardware. It is essentially an instant on-demand cluster for running applications. Using the ElasticFox Firefox plugin and the setup directions for Disco on EC2, I was able to quickly test GFF parsing on a test cluster of three small (AMI ami-cfbc58a6, a Debian 5.0 Lenny instance) instances. For distributed jobs, the main challenges are setting up each of the cluster nodes with the software, and distributing the files across the nodes. Disco provides scripts to install itself across the cluster and to distribute the file being parsed. When you are attacking a GFF parsing job that is prohibitively slow or memory intensive on your local hardware, a small cluster of a few extra-large of extra-large high CPU instances on EC2 will help you overcome these limitations. Hopefully in the future Disco will become available on some standard Amazon machine images, lowering the threshold to getting a job running.

In practical terms, local GFF parsing will be fine for most standard files. When you are limited by parsing time with large files, attack the problem using either a local cluster or EC2 with 8 or more processors. To better utilize a small number of local CPUs, it makes sense to explore a light weight solution such as the new python multiprocessing module.

GFF2 support

The initial target for GFF parsing was the GFF3 standard. However, many genome centers still use the older GFF2 or GTF formats. The main parsing difference between these formats are the attributes. In GFF3, they look like:

  ID=CDS:B0019.1;Parent=Transcript:B0019.1;locus=amx-2

while in GFF2 they are less standardized, and look like:

  Transcript "B0019.1" ; WormPep "WP:CE40797" ; Note "amx-2"

The parser has been updated to handle GFF2 attributes correctly, with test cases from several genome centers. In practice, there are several tricky implementations of the GFF2 specifications; if you find examples of incorrectly parsed attributes by the current parser, please pass them along.

GFF2 and GFF3 also differ in how nested features are handled. A standard example of nesting is specifying the coding regions of a transcript. Since GFF2 didn’t provide a default way to do this, there are several different methods used in practice. Currently, the parser leaves these GFF2 features as flat and you would need to write custom code on top of the parser to nest them if desired.

The latest version of the GFF parsing code is available from GitHub. To install it, click the download link on that page and you will get the whole directory along with a setup.py file to install it. It installs outside of Biopython since it is still under development. As always, I am happy to accept any contributions or suggestions.

Written by Brad Chapman

March 29, 2009 at 10:49 am

Posted in OpenBio

Tagged with bioinformatics, biopython, cloud-computing, ec2, gff

9 Responses

Subscribe to comments with RSS.

The GFF parser seems to be pretty good
Thanks
I have a doubt whether we can parse information for gene, mRNA, CDS, exon and UTR region in a single strech from a GFF file…

Anonymous

April 19, 2009 at 1:57 am

Reply
- Thanks for the message on the GFF parser. If you have an example file that is not working as expected, please do send along details about the file. You can do so here, or on the Biopython development list (http://biopython.org/wiki/Mailing_lists).
  
  Brad Chapman
  
  April 19, 2009 at 1:54 pm
  
  Reply
- To whom it may concern,
  
  Thanks for the development of a quick parser for GFF files. It is very useful.
  
  I have a doubt,
  I used the GFFParser.py program to extract the genome annotation from the file attached with this mail. Please find the attached file. (Because of the size of file here I included a few lines)
  I wrote a python script like this
  
  ##################################################
  import GFFParser
  
  pgff = GFFParser.GFFMapReduceFeatureAdder(dict(), None)
  
  cds_limit_info = dict(
  gff_type = [“gene”,”mRNA”,”CDS”,”exon”],
  gff_id = [“Chr1”]
  )
  
  pgff.add_features(‘../PythonGFF/TAIR9_GFF_genes.gff3’, cds_limit_info)
  
  pgff.base[“Chr1”]
  
  final = pgff.base[“Chr1”]
  
  ##################################################
  
  By executing this script I am able to extract gene, mRNA and exon annotation from specified GFF file. But I am unable to extract the CDS related information from GFF file.
  It will be great if you can suggest me an idea to include gene, mRNA, exon and CDS information in a single strech of parsing of GFF file.
  
  Thanks in advance,
  
  Vipin T S
  vipin.ts@gmail.com
  
  Vipin T S
  
  August 14, 2009 at 8:17 am
  
  Reply
  - Vipin;
    Thanks for the report. A detailed response is on the Biopython mailing list:
    
    http://lists.open-bio.org/pipermail/biopython/2009-August/005437.html
    
    Let me know if you have any other problems,
    Brad
    
    Brad Chapman
    
    August 14, 2009 at 3:54 pm
    
    Reply
I would like to use the parser to convert my gff2 in gff3, but the application gave me this error:

student@410083:~/Desktop/chap/gff/Scripts/gff$ python gff2_to_gff3.py install
Traceback (most recent call last):
File “gff2_to_gff3.py”, line 12, in
from BCBio.GFF import GFFParser, GFF3Writer
File “build/bdist.linux-x86_64/egg/BCBio/GFF/__init__.py”, line 3, in
# $Id: __init__.py 2134 2004-10-06 08:55:20Z fredrik $
File “build/bdist.linux-x86_64/egg/BCBio/GFF/GFFParser.py”, line 33, in
ImportError: cannot import name UnknownSeq

Do you have any suggestion to resolve it?

Thanx

Paolo

July 24, 2009 at 10:53 am

Reply
- Paolo;
  Thanks for giving this a try. It looks like you have an older version of Biopython without the UnknownSeq class. You can check your version with:
  
  >>> import Bio
  >>> Bio.__version__
  ‘1.51b+’
  
  You’ll want 1.50 or later to work with the GFF parser. You can upgrade with:
  
  easy_install -U biopython
  
  as the root user. Let me know if you have any other problems,
  Brad
  
  Brad Chapman
  
  July 24, 2009 at 4:32 pm
  
  Reply
  - Thanks Brad!
    I don’t know why but with update option I obtained the 1.49 version of Biopython, but I manually downloaded the v1.51 and now the converter perfectly works!
    
    Paolo
    
    Paolo
    
    July 27, 2009 at 4:09 am
    
    Reply
Interesting … I have been looking at something similar, but my stack is mongodb and celery with ghettoq.

The point here is to parse the GFF into json and then store in mongodb. With ghettoq and celery you drop the need for erlang.

James Casbon

May 7, 2010 at 9:41 am

Reply
- James;
  Sounds like a cool idea. I’m a big fan of MongoDB. Celery and RabbitMQ look cool but haven’t had an opportunity to play with them. I’d be interested in hearing what you come up with.
  
  Brad Chapman
  
  May 8, 2010 at 9:55 am
  
  Reply

Blue Collar Bioinformatics

9 Responses

Leave a comment Cancel reply

Recent Posts

Blue Collar Bioinformatics