Exploring BioPerl GenBank to GFF mapping
A mailing list message from Peter about importing GFF files to BioSQL inspired me to take a look at how BioPerl treats GFF files. Generic Feature Format (GFF) is a plain text file format used to represent annotations and features on biological sequences. It is a nice biological file format:
- Parsed relatively easily.
- Human readable and editable in Excel.
- Quickly understood at a basic level.
- Flexible and adapting. GFF3, the current format, handles a number of incompatibility issues that arose in GFF2.
- Widely used.
BioSQL is a relational database model that stores annotations and features on sequences. As Peter implies, having a general mapping between the two would facilitate plain text database dumps from BioSQL databases in GFF. Conversely, GFF formatted files could be loaded directly into BioSQL databases.
The BioSQL object model maps very closely to the GenBank file format, so a good way to examine the BioPerl to BioSQL mapping is to produce GFF from a GenBank file. The BioPerl distribution contains a script to do exactly this:
bp_genbank2gff3.pl -out stdout cbx8.gb > cbx8.gff
Starting with this straightforward GenBank file, the above command produces a GFF file that I will explore more below. GFF files are structured as tab delimited columns. The first 8 columns describe the exact sequence location and contain a Sequence Ontology term describing the relationship between the annotation and the sequence region. The final column is a set of key-value pairs with the annotation data. For example, here is a line from our output file:
NM_001078975 GenBank gene 1 1847 . + . ID=cbx8;Dbxref=GeneID:779897;Note=chromobox homolog 8;gene=cbx8; gene_synonym=MGC147589
This maps directly to the corresponding feature in the original GenBank table:
gene 1..1847 /gene="cbx8" /gene_synonym="MGC147589" /note="chromobox homolog 8" /db_xref="GeneID:779897"
This is a nice one-to-one mapping of the GenBank feature table. The ontology for mapping feature keys to the sequence ontology terms was discussed in more detail in an earlier post on BioSQL ontologies. Here, the qualifier names map to uppercase standard keys where possible (Note, DBxref) and all lowercase names where they do not characterize a standard term. For BioSQL, these GFF lines would map directly into the
seqfeature table, with a dictionary to provide the back and forth mapping between standard terms and qualifier names.
The less straightforward part of the mapping involves the high level annotations which describe the entire sequence. This corresponds to the header section in the GenBank file and maps to several specialized tables in the BioSQL schema. Here is a summary of the current mappings in BioPerl GFF:
|GenBank||BioSQL table||Current BioPerl GFF||Proposed GFF key/value|
|LOCUS; identifier ACCESSION||
|LOCUS; Molecule type||
|Note, but combined with COMMENT||description|
accession and version
|SOURCE and ORGANISM||
|organism and Dbxref to taxon ID||Full lineage needs representation as well|
|Dbxref for PubMed IDs; need to store full reference information as well|
|comment1 and Note, combined with DEFINITION||comment1 only|
Most of the major mappings are in place, with some naming refinement needed. The most complicated outstanding aspect would be storing the reference journal information. Someone more familiar with GFF may be able to offer a solution that has been used previously. My guess at this point is that each reference would be a separate GFF line item, with key/value pairs for the authors, title and other information.
Overall, GFF offers a nice flat file output format for BioSQL databases. Much of the mapping from GFF to BioSQL is in place currently in BioPerl, with consensus needed for the missing parts. With that established, the other languages that support BioSQL can follow the BioPerl mapping. In my view, being able to round-trip between GFF flat files and the BioSQL relational database would help drive usage of both.