biosql

Posts Tagged ‘biosql’

BioSQL on Google App Engine

The BioSQL project provides a well thought out relational database schema for storing biological sequences and annotations. For those developers who are responsible for setting up local stores of biological data, BioSQL provides a huge advantage via reusability. Some of the best features of BioSQL from my experience are:

Available interfaces for several languages (via Biopython, BioPerl, BioJava and BioRuby).
Flexible storage of data via a key/value pair model. This models information in an extensible manner, and helps with understanding distributed key/value stores like SimpleDB and CouchDB.
Overall data model based on GenBank flat files. This makes teaching the model to biology oriented users much easier; you can pull up a text file from NCBI with a sequence and directly show how items map to the database.

Given the usefulness of BioSQL for local relational data storage, I would like to see it move into the rapidly expanding cloud development community. Tying BioSQL data storage in with Web frameworks will help researchers make their data publicly available earlier and in standard formats. As a nice recent example, George has a series of posts on using BioSQL with Ruby on Rails. There have also been several discussions of the BioSQL mailing list around standard web tools and APIs to sit on top of the database; see this thread for a recent example.

Towards these goals, I have been working on a BioSQL backed interface for Google App Engine. Google App Engine is a Python based framework to quickly develop and deploy web applications. For data storage, Google’s Datastore provides an object interface to a distributed scalable storage backend. Practically, App Engine has free hosting quotas which can scale to larger instances as demand for the data in the application increases; this will appeal to cost-conscious researchers by avoiding an initial barrier to making their data available.

My work on this was accelerated by the Open Bioinformatics Foundation’s move to apply for participation in Google’s Summer of Code. OpenBio is a great community that helps organize projects like BioPerl, Biopython, BioJava and BioRuby. After writing up a project idea for BioSQL on Google App Engine in our application, I was inspired to finish a demonstration of the idea.

I am happy to announce a simple demonstration server running a BioSQL based backend: BioSQL Web. The source code is available from my git repository. Currently the server allows uploads of GenBank formatted files and provides a simple view of the records, annotations and sequences. The data is stored in the Google Datastore with an object interface that mimics the BioSQL relational model.

Future posts will provide more details on the internals of the server end and client interface as they develop. As always, feedback, thoughts and code contributions are very welcome.

Written by Brad Chapman

March 15, 2009 at 8:45 am

Posted in OpenBio

Tagged with bioinformatics, biopython, biosql, cloud-computing, gae, gsoc, OpenBio

Initial GFF parser for Biopython

with 10 comments

Generic feature format (GFF) is a nice plain text file format for storing annotations on biological sequences, and would be very useful tied in with the BioSQL relational database. Two weeks ago, I detailed the Bioperl GenBank to GFF mapping, which provided some introductory background to the problem. Here I’m continuing to explore GFF and BioSQL together, but from the opposite direction.

I implemented an initial pass at a python GFF parser that will hopefully eventually be included in Biopython. You can find the current code in the git repository. I’d be very happy to have others help with development, provide usage feedback and pass along difficult GFF files for testing.

Implementation

GFF parsing is a little tricker to fit into the Biopython SeqIO system than other sequence file formats. Formats like GenBank or UniProt contain a sequence and its features combined into a single record. This allows parsers to iterate over files one at a time, returning generic sequence objects from each record. This scales with large files so your memory and processor worries are bounded by the most complicated record in the file.

In contrast, GFF files are separate from the primary sequences and do not have any guarantees about annotations for a record being grouped together. To be sure you’ve picked up all features for a record, you need to parse the entire GFF file. For large real-life files this becomes a problem as all of the features being added will rapidly fill up available memory.

To solve these problems, GFF parsing is implemented here as a feature addition module with filtering. This means that you first use standard Biopython SeqIO to parse the base sequence records, and then use the GFF class to add features to these initial records. The addition function has an option argument allowing added features to be limited to a subset of interest. You can limit based on record names and add all features related to a specific sequence, or you can limit based on types of GFF features and add these features to all records in the file.

This example demonstrates the use of the GFF parser to parse out all of the coding sequence features for chromosome one ('I'), and add them to the initial record:

from Bio import SeqIO
from BCBio.SeqIO.GFFIO import GFFFeatureAdder

with open(seq_file) as seq_handle:
    seq_dict = SeqIO.to_dict(SeqIO.parse(seq_handle, "fasta"))
feature_adder = GFFFeatureAdder(seq_dict)
cds_limit_info = dict(
        gff_types = [('Coding_transcript', 'gene'),
                     ('Coding_transcript', 'mRNA'),
                     ('Coding_transcript', 'CDS')],
        gff_id = ['I']
        )
with open(gff_file) as gff_handle:
    feature_adder.add_features(gff_handle, cds_limit_info)
final_rec = feature_adder.base['I']

This example shows the other unique aspect of GFF parsing: nested features. In the example above we pull out coding genes, mRNA transcripts, and coding sequences (CDS). These are nested as a gene can have multiple mRNAs, and CDSs are mapped to one or more mRNA transcripts. In Biopython this is handled naturally using the sub_feature attribute of SeqFeature. So when handling the record, you will dig into a gene feature to find its transcripts and coding sequences. For a more detailed description of how GFF can be mapped to complex transcripts, see the GFF3 documentation, which has diagrams and examples of different biological cases and how they are represented.

Testing

The test code features several other usage examples which should help provide familiarity with the interface. For real life testing, this was run against the latest C elegans WormBase release, WS199: GFF; sequences. On a standard single processor workstation, the code took about 2 and a half minutes to parse all PCR products and coding sequences from the 1.3G GFF file and 100M genome fasta file.

To go full circle back to my initial inspiration, the parsed GFF was pushed into a BioSQL database using this script. To test on your own machine, you will have to adjust the database details at the start of the script to match your local configuration instead of my test database.

Standard flattened features are well supported going into BioSQL. Nested features, like the coding sequence representation mentioned above, will need additional work. The current loader only utilizes sub_features to get location information and support the join(1..3,5..8) syntax of GenBank. The seqfeature_relationship table in BioSQL seems like the right place to start to support this.

Summary

This provides an initial implementation of GFF3 parsing support for Biopython. The interface is a proposal and I welcome suggestions on making it more intuitive. Code and test example contributions are also much appreciated. As we find an interface and implementation that works for the python community and the code stabilizes, we can work to integrate this into the Biopython project.

Written by Brad Chapman

March 8, 2009 at 11:01 am

Posted in OpenBio

Tagged with bioinformatics, biopython, biosql, gff, python

Authorization schema for BioSQL databases

Exploring BioPerl GenBank to GFF mapping

leave a comment »

A mailing list message from Peter about importing GFF files to BioSQL inspired me to take a look at how BioPerl treats GFF files. Generic Feature Format (GFF) is a plain text file format used to represent annotations and features on biological sequences. It is a nice biological file format:

Parsed relatively easily.
Human readable and editable in Excel.
Quickly understood at a basic level.
Flexible and adapting. GFF3, the current format, handles a number of incompatibility issues that arose in GFF2.
Widely used.

BioSQL is a relational database model that stores annotations and features on sequences. As Peter implies, having a general mapping between the two would facilitate plain text database dumps from BioSQL databases in GFF. Conversely, GFF formatted files could be loaded directly into BioSQL databases.

The BioSQL object model maps very closely to the GenBank file format, so a good way to examine the BioPerl to BioSQL mapping is to produce GFF from a GenBank file. The BioPerl distribution contains a script to do exactly this:

bp_genbank2gff3.pl -out stdout cbx8.gb > cbx8.gff

Starting with this straightforward GenBank file, the above command produces a GFF file that I will explore more below. GFF files are structured as tab delimited columns. The first 8 columns describe the exact sequence location and contain a Sequence Ontology term describing the relationship between the annotation and the sequence region. The final column is a set of key-value pairs with the annotation data. For example, here is a line from our output file:

NM_001078975    GenBank gene    1       1847    .       +       .       
ID=cbx8;Dbxref=GeneID:779897;Note=chromobox homolog 8;gene=cbx8;
gene_synonym=MGC147589

This maps directly to the corresponding feature in the original GenBank table:

     gene            1..1847
                     /gene="cbx8"
                     /gene_synonym="MGC147589"
                     /note="chromobox homolog 8"
                     /db_xref="GeneID:779897"

This is a nice one-to-one mapping of the GenBank feature table. The ontology for mapping feature keys to the sequence ontology terms was discussed in more detail in an earlier post on BioSQL ontologies. Here, the qualifier names map to uppercase standard keys where possible (Note, DBxref) and all lowercase names where they do not characterize a standard term. For BioSQL, these GFF lines would map directly into the seqfeature table, with a dictionary to provide the back and forth mapping between standard terms and qualifier names.

The less straightforward part of the mapping involves the high level annotations which describe the entire sequence. This corresponds to the header section in the GenBank file and maps to several specialized tables in the BioSQL schema. Here is a summary of the current mappings in BioPerl GFF:

GenBank	BioSQL table	Current BioPerl GFF	Proposed GFF key/value
LOCUS; identifier ACCESSION	bioentry name	ID
LOCUS; Molecule type	biosequence alphabet	mol_type
LOCUS; division	bioentry division		division
LOCUS; date	bioentry_qualifer_value term date_changed	date
DEFINITION	bioentry description	Note, but combined with COMMENT	description
VERSION	bioentry accession and version		hasVersion
GI	bioentry identifier		identifier
KEYWORDS	bioentry_qualifer_value term keywords		subject
SOURCE and ORGANISM	taxon	organism and Dbxref to taxon ID	Full lineage needs representation as well
REFERENCE	reference		Dbxref for PubMed IDs; need to store full reference information as well
COMMENT	comment	comment1 and Note, combined with DEFINITION	comment1 only

Most of the major mappings are in place, with some naming refinement needed. The most complicated outstanding aspect would be storing the reference journal information. Someone more familiar with GFF may be able to offer a solution that has been used previously. My guess at this point is that each reference would be a separate GFF line item, with key/value pairs for the authors, title and other information.

Overall, GFF offers a nice flat file output format for BioSQL databases. Much of the mapping from GFF to BioSQL is in place currently in BioPerl, with consensus needed for the missing parts. With that established, the other languages that support BioSQL can follow the BioPerl mapping. In my view, being able to round-trip between GFF flat files and the BioSQL relational database would help drive usage of both.

Edit: James Procter put together a BioSQL wiki page to help specify the mapping. Please help contribute there and ask questions on the BioSQL mailing list.

Written by Brad Chapman

February 22, 2009 at 3:56 pm

Posted in OpenBio

Tagged with bioinformatics, bioperl, biosql, gff

Blue Collar Bioinformatics

Posts Tagged ‘biosql’

BioSQL on Google App Engine

Initial GFF parser for Biopython

Implementation

Testing

BioSQL

Summary

Authorization schema for BioSQL databases

Exploring BioPerl GenBank to GFF mapping

Recent Posts