Blue Collar Bioinformatics

Note: new posts have moved to Please look there for the latest updates and comments

Exploring BioPerl GenBank to GFF mapping

leave a comment »

A mailing list message from Peter about importing GFF files to BioSQL inspired me to take a look at how BioPerl treats GFF files. Generic Feature Format (GFF) is a plain text file format used to represent annotations and features on biological sequences. It is a nice biological file format:

  • Parsed relatively easily.
  • Human readable and editable in Excel.
  • Quickly understood at a basic level.
  • Flexible and adapting. GFF3, the current format, handles a number of incompatibility issues that arose in GFF2.
  • Widely used.

BioSQL is a relational database model that stores annotations and features on sequences. As Peter implies, having a general mapping between the two would facilitate plain text database dumps from BioSQL databases in GFF. Conversely, GFF formatted files could be loaded directly into BioSQL databases.

The BioSQL object model maps very closely to the GenBank file format, so a good way to examine the BioPerl to BioSQL mapping is to produce GFF from a GenBank file. The BioPerl distribution contains a script to do exactly this: -out stdout > cbx8.gff

Starting with this straightforward GenBank file, the above command produces a GFF file that I will explore more below. GFF files are structured as tab delimited columns. The first 8 columns describe the exact sequence location and contain a Sequence Ontology term describing the relationship between the annotation and the sequence region. The final column is a set of key-value pairs with the annotation data. For example, here is a line from our output file:

NM_001078975    GenBank gene    1       1847    .       +       .       
ID=cbx8;Dbxref=GeneID:779897;Note=chromobox homolog 8;gene=cbx8;

This maps directly to the corresponding feature in the original GenBank table:

     gene            1..1847
                     /note="chromobox homolog 8"

This is a nice one-to-one mapping of the GenBank feature table. The ontology for mapping feature keys to the sequence ontology terms was discussed in more detail in an earlier post on BioSQL ontologies. Here, the qualifier names map to uppercase standard keys where possible (Note, DBxref) and all lowercase names where they do not characterize a standard term. For BioSQL, these GFF lines would map directly into the seqfeature table, with a dictionary to provide the back and forth mapping between standard terms and qualifier names.

The less straightforward part of the mapping involves the high level annotations which describe the entire sequence. This corresponds to the header section in the GenBank file and maps to several specialized tables in the BioSQL schema. Here is a summary of the current mappings in BioPerl GFF:

GenBank BioSQL table Current BioPerl GFF Proposed GFF key/value


LOCUS; Molecule type


LOCUS; division


LOCUS; date

term date_changed



Note, but combined with COMMENT description

accession and version




term keywords

organism and Dbxref to taxon ID Full lineage needs representation as well
  Dbxref for PubMed IDs; need to store full reference information as well
comment1 and Note, combined with DEFINITION comment1 only

Most of the major mappings are in place, with some naming refinement needed. The most complicated outstanding aspect would be storing the reference journal information. Someone more familiar with GFF may be able to offer a solution that has been used previously. My guess at this point is that each reference would be a separate GFF line item, with key/value pairs for the authors, title and other information.

Overall, GFF offers a nice flat file output format for BioSQL databases. Much of the mapping from GFF to BioSQL is in place currently in BioPerl, with consensus needed for the missing parts. With that established, the other languages that support BioSQL can follow the BioPerl mapping. In my view, being able to round-trip between GFF flat files and the BioSQL relational database would help drive usage of both.

Edit: James Procter put together a BioSQL wiki page to help specify the mapping. Please help contribute there and ask questions on the BioSQL mailing list.

Written by Brad Chapman

February 22, 2009 at 3:56 pm

Posted in OpenBio

Tagged with , , ,

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: