Blue Collar Bioinformatics

Note: new posts have moved to http://bcb.io/ Please look there for the latest updates and comments

Posts Tagged ‘biosql

BioSQL on Google App Engine

with 4 comments

The BioSQL project provides a well thought out relational database schema for storing biological sequences and annotations. For those developers who are responsible for setting up local stores of biological data, BioSQL provides a huge advantage via reusability. Some of the best features of BioSQL from my experience are:

  • Available interfaces for several languages (via Biopython, BioPerl, BioJava and BioRuby).
  • Flexible storage of data via a key/value pair model. This models information in an extensible manner, and helps with understanding distributed key/value stores like SimpleDB and CouchDB.
  • Overall data model based on GenBank flat files. This makes teaching the model to biology oriented users much easier; you can pull up a text file from NCBI with a sequence and directly show how items map to the database.

Given the usefulness of BioSQL for local relational data storage, I would like to see it move into the rapidly expanding cloud development community. Tying BioSQL data storage in with Web frameworks will help researchers make their data publicly available earlier and in standard formats. As a nice recent example, George has a series of posts on using BioSQL with Ruby on Rails. There have also been several discussions of the BioSQL mailing list around standard web tools and APIs to sit on top of the database; see this thread for a recent example.

Towards these goals, I have been working on a BioSQL backed interface for Google App Engine. Google App Engine is a Python based framework to quickly develop and deploy web applications. For data storage, Google’s Datastore provides an object interface to a distributed scalable storage backend. Practically, App Engine has free hosting quotas which can scale to larger instances as demand for the data in the application increases; this will appeal to cost-conscious researchers by avoiding an initial barrier to making their data available.

My work on this was accelerated by the Open Bioinformatics Foundation’s move to apply for participation in Google’s Summer of Code. OpenBio is a great community that helps organize projects like BioPerl, Biopython, BioJava and BioRuby. After writing up a project idea for BioSQL on Google App Engine in our application, I was inspired to finish a demonstration of the idea.

I am happy to announce a simple demonstration server running a BioSQL based backend: BioSQL Web. The source code is available from my git repository. Currently the server allows uploads of GenBank formatted files and provides a simple view of the records, annotations and sequences. The data is stored in the Google Datastore with an object interface that mimics the BioSQL relational model.

Future posts will provide more details on the internals of the server end and client interface as they develop. As always, feedback, thoughts and code contributions are very welcome.

Written by Brad Chapman

March 15, 2009 at 8:45 am

Initial GFF parser for Biopython

with 10 comments

Generic feature format (GFF) is a nice plain text file format for storing annotations on biological sequences, and would be very useful tied in with the BioSQL relational database. Two weeks ago, I detailed the Bioperl GenBank to GFF mapping, which provided some introductory background to the problem. Here I’m continuing to explore GFF and BioSQL together, but from the opposite direction.

I implemented an initial pass at a python GFF parser that will hopefully eventually be included in Biopython. You can find the current code in the git repository. I’d be very happy to have others help with development, provide usage feedback and pass along difficult GFF files for testing.

Implementation

GFF parsing is a little tricker to fit into the Biopython SeqIO system than other sequence file formats. Formats like GenBank or UniProt contain a sequence and its features combined into a single record. This allows parsers to iterate over files one at a time, returning generic sequence objects from each record. This scales with large files so your memory and processor worries are bounded by the most complicated record in the file.

In contrast, GFF files are separate from the primary sequences and do not have any guarantees about annotations for a record being grouped together. To be sure you’ve picked up all features for a record, you need to parse the entire GFF file. For large real-life files this becomes a problem as all of the features being added will rapidly fill up available memory.

To solve these problems, GFF parsing is implemented here as a feature addition module with filtering. This means that you first use standard Biopython SeqIO to parse the base sequence records, and then use the GFF class to add features to these initial records. The addition function has an option argument allowing added features to be limited to a subset of interest. You can limit based on record names and add all features related to a specific sequence, or you can limit based on types of GFF features and add these features to all records in the file.

This example demonstrates the use of the GFF parser to parse out all of the coding sequence features for chromosome one ('I'), and add them to the initial record:

from Bio import SeqIO
from BCBio.SeqIO.GFFIO import GFFFeatureAdder

with open(seq_file) as seq_handle:
    seq_dict = SeqIO.to_dict(SeqIO.parse(seq_handle, "fasta"))
feature_adder = GFFFeatureAdder(seq_dict)
cds_limit_info = dict(
        gff_types = [('Coding_transcript', 'gene'),
                     ('Coding_transcript', 'mRNA'),
                     ('Coding_transcript', 'CDS')],
        gff_id = ['I']
        )
with open(gff_file) as gff_handle:
    feature_adder.add_features(gff_handle, cds_limit_info)
final_rec = feature_adder.base['I']

This example shows the other unique aspect of GFF parsing: nested features. In the example above we pull out coding genes, mRNA transcripts, and coding sequences (CDS). These are nested as a gene can have multiple mRNAs, and CDSs are mapped to one or more mRNA transcripts. In Biopython this is handled naturally using the sub_feature attribute of SeqFeature. So when handling the record, you will dig into a gene feature to find its transcripts and coding sequences. For a more detailed description of how GFF can be mapped to complex transcripts, see the GFF3 documentation, which has diagrams and examples of different biological cases and how they are represented.

Testing

The test code features several other usage examples which should help provide familiarity with the interface. For real life testing, this was run against the latest C elegans WormBase release, WS199: GFF; sequences. On a standard single processor workstation, the code took about 2 and a half minutes to parse all PCR products and coding sequences from the 1.3G GFF file and 100M genome fasta file.

BioSQL

To go full circle back to my initial inspiration, the parsed GFF was pushed into a BioSQL database using this script. To test on your own machine, you will have to adjust the database details at the start of the script to match your local configuration instead of my test database.

Standard flattened features are well supported going into BioSQL. Nested features, like the coding sequence representation mentioned above, will need additional work. The current loader only utilizes sub_features to get location information and support the join(1..3,5..8) syntax of GenBank. The seqfeature_relationship table in BioSQL seems like the right place to start to support this.

Summary

This provides an initial implementation of GFF3 parsing support for Biopython. The interface is a proposal and I welcome suggestions on making it more intuitive. Code and test example contributions are also much appreciated. As we find an interface and implementation that works for the python community and the code stabilizes, we can work to integrate this into the Biopython project.

Written by Brad Chapman

March 8, 2009 at 11:01 am

Posted in OpenBio

Tagged with , , , ,

Authorization schema for BioSQL databases

leave a comment »

The BioSQL relational database model provides a set of tables for storing biological sequences and information related to them. The model covers many common use cases when representing sequence information, and serves as a reusable component that can be plugged into many projects.

For larger web-based projects, users often would like a permissions system that:

  • Allows storage of data available only to the user, for collecting experimental data that is in progress.
  • Shares data with other colleagues within defined groups.
  • Defines the ability to edit data.
  • Stores user specific preferences and defaults.

Here I present a simple relational database model for MySQL that ties in with BioSQL and provides this functionality. The full schema can be downloaded from github and is described in more detail below.

The authorization framework is based on a three tiered system used in the Turbogears web framework:

  • Users — these are the base logged in users. The auth_user table defines them and includes user_names, passwords, and e-mail information.
  • Groups — each user can belong to one or more groups. These groups will often represent real life groups, like research laboratories or collaborative groups.
  • Permissions — each group has one or more permissions associated with it. This assigns a group one or more things it can do or access.

In BioSQL, we can associate permissions with many items using generic key value pairs. These are implemented as qualifier values, and many different items have this association mechanism including biological entries (bioentry), features, and database references (dbxref). The BioSQL schema description goes into more detail about the representation, which is very flexible.

As an example of using this framework, suppose you add several new sequence you want to be available to only users in your research group. You would do the following:

  1. Create a new permission (auth_permission table).
  2. Associate the permission with your group (auth_group_permission table).
  3. Assign the new permission id to a qualifer value named permissions on the bioentries of interest (bioentry_qualifer_value table, in BioSQL).

The web interface should then be designed to check for permissions and display only those available to the current users. This can be done very flexibly according to your design preferences; in the most transparent case displays would only include those items for which permissions are available. Users without permissions could proceed unaware of these items they cannot access.

The final desired item of functionality is storing user specific information. This helps a user set defaults for a web interface, save queries or other complex information, and generally achieve a more pleasant browsing experience. This information is stored using the same key-value pair mechanism described above, in the auth_user_qualifier_value table. This should be very familiar to BioSQL developers, and flexibly allows us to store any kind of information of interest.

Hopefully this authentication system is useful to others implementing interfaces on top of BioSQL databases. In subsequent posts, I will provide client and server code which helps manage user authorization within this system.

Written by Brad Chapman

March 1, 2009 at 12:48 pm

Posted in OpenBio

Tagged with , , ,

Exploring BioPerl GenBank to GFF mapping

leave a comment »

A mailing list message from Peter about importing GFF files to BioSQL inspired me to take a look at how BioPerl treats GFF files. Generic Feature Format (GFF) is a plain text file format used to represent annotations and features on biological sequences. It is a nice biological file format:

  • Parsed relatively easily.
  • Human readable and editable in Excel.
  • Quickly understood at a basic level.
  • Flexible and adapting. GFF3, the current format, handles a number of incompatibility issues that arose in GFF2.
  • Widely used.

BioSQL is a relational database model that stores annotations and features on sequences. As Peter implies, having a general mapping between the two would facilitate plain text database dumps from BioSQL databases in GFF. Conversely, GFF formatted files could be loaded directly into BioSQL databases.

The BioSQL object model maps very closely to the GenBank file format, so a good way to examine the BioPerl to BioSQL mapping is to produce GFF from a GenBank file. The BioPerl distribution contains a script to do exactly this:


bp_genbank2gff3.pl -out stdout cbx8.gb > cbx8.gff

Starting with this straightforward GenBank file, the above command produces a GFF file that I will explore more below. GFF files are structured as tab delimited columns. The first 8 columns describe the exact sequence location and contain a Sequence Ontology term describing the relationship between the annotation and the sequence region. The final column is a set of key-value pairs with the annotation data. For example, here is a line from our output file:

NM_001078975    GenBank gene    1       1847    .       +       .       
ID=cbx8;Dbxref=GeneID:779897;Note=chromobox homolog 8;gene=cbx8;
gene_synonym=MGC147589

This maps directly to the corresponding feature in the original GenBank table:

     gene            1..1847
                     /gene="cbx8"
                     /gene_synonym="MGC147589"
                     /note="chromobox homolog 8"
                     /db_xref="GeneID:779897"

This is a nice one-to-one mapping of the GenBank feature table. The ontology for mapping feature keys to the sequence ontology terms was discussed in more detail in an earlier post on BioSQL ontologies. Here, the qualifier names map to uppercase standard keys where possible (Note, DBxref) and all lowercase names where they do not characterize a standard term. For BioSQL, these GFF lines would map directly into the seqfeature table, with a dictionary to provide the back and forth mapping between standard terms and qualifier names.

The less straightforward part of the mapping involves the high level annotations which describe the entire sequence. This corresponds to the header section in the GenBank file and maps to several specialized tables in the BioSQL schema. Here is a summary of the current mappings in BioPerl GFF:

GenBank BioSQL table Current BioPerl GFF Proposed GFF key/value
LOCUS; identifier ACCESSION
bioentry

name

ID  
LOCUS; Molecule type
biosequence

alphabet

mol_type  
LOCUS; division
bioentry

division

  division
LOCUS; date
bioentry_qualifer_value

term date_changed

date  
DEFINITION
bioentry

description

Note, but combined with COMMENT description
VERSION
bioentry

accession and version

  hasVersion
GI
bioentry

identifier

  identifier
KEYWORDS
bioentry_qualifer_value

term keywords

  subject
SOURCE and ORGANISM
taxon
organism and Dbxref to taxon ID Full lineage needs representation as well
REFERENCE
reference
  Dbxref for PubMed IDs; need to store full reference information as well
COMMENT
comment
comment1 and Note, combined with DEFINITION comment1 only

Most of the major mappings are in place, with some naming refinement needed. The most complicated outstanding aspect would be storing the reference journal information. Someone more familiar with GFF may be able to offer a solution that has been used previously. My guess at this point is that each reference would be a separate GFF line item, with key/value pairs for the authors, title and other information.

Overall, GFF offers a nice flat file output format for BioSQL databases. Much of the mapping from GFF to BioSQL is in place currently in BioPerl, with consensus needed for the missing parts. With that established, the other languages that support BioSQL can follow the BioPerl mapping. In my view, being able to round-trip between GFF flat files and the BioSQL relational database would help drive usage of both.

Edit: James Procter put together a BioSQL wiki page to help specify the mapping. Please help contribute there and ask questions on the BioSQL mailing list.

Written by Brad Chapman

February 22, 2009 at 3:56 pm

Posted in OpenBio

Tagged with , , ,