Blue Collar Bioinformatics

Note: new posts have moved to http://bcb.io/ Please look there for the latest updates and comments

Posts Tagged ‘gae

Talking at BOSC 2009 about publishing biological data on the web

with 3 comments

The Bioinformatics Open Source Conference (BOSC) is taking place later this month in Stockholm, Sweden. I will be attending for the first time in a few years, and giving a short and sweet 10 minute talk about ideas for publishing biological data on the web. BOSC provides a chance to meet and talk with many of the great people involved in open source bioinformatics; the schedule this year looks fantastic. The talk will be held in conjunction with The Data and Analysis Management special interest group, also full of interesting talks.

The talk will promote development of reusable web based interface libraries layered on top of existing open source projects. The PDF abstract provides the full description and motivation; below is a more detailed outline based on some brainstorming and organization:

  • Motivation: rapidly organize and display biological data in a web accessible format.
  • Current state: reusable bioinformatics libraries targeted at programmers — Biopython, bx-python, pygr, PyCogent
  • Current state: back end databases for storing biological data — BioSQL, GMOD
  • Current state: full featured web applications targeted at users — Galaxy, GBrowse
  • My situation: biologist and developer with organized data that needs analysis and presentation, internally with collaborators and externally with larger community.
  • Proposal: integrate bioinformatics libraries, database schemas, and open source web development frameworks to provide re-usable components that can serve as a base for custom data presentation.
  • Framework: utilize cloud infrastructure for reliable deployment — Google App Engine, Amazon EC2
  • Framework: make use of front end javascript frameworks — jQuery, ExtJS.
  • Framework: make use of back end web frameworks — Pylons
  • Implementation: Demo server for displaying sequences plus annotations
  • Implementation: Utilizes BioSQL schema, ported to object oriented data store; Google App engine backend or MongoDB backend
  • Implementation: Data import/export with Biopython libraries — GenBank in and GFF out
  • Implementation: Additional screenshots from internal web displays.
  • Challenges: Generalizing and organizing display and retrieval code without having to buy into a large framework.
  • Challenges: Re-usable components for cross-language functionality; javascript front end displays for multi-language back ends.
  • Challenges: Build a community that thinks of reusing and sharing display code as much as parsing and pipeline development code.

I would be happy to hear comments or suggestions about the talk. If you’re going to BOSC and want to meet up, definitely drop me a line.

Written by Brad Chapman

June 11, 2009 at 7:41 am

BioSQL on Google App Engine

with 4 comments

The BioSQL project provides a well thought out relational database schema for storing biological sequences and annotations. For those developers who are responsible for setting up local stores of biological data, BioSQL provides a huge advantage via reusability. Some of the best features of BioSQL from my experience are:

  • Available interfaces for several languages (via Biopython, BioPerl, BioJava and BioRuby).
  • Flexible storage of data via a key/value pair model. This models information in an extensible manner, and helps with understanding distributed key/value stores like SimpleDB and CouchDB.
  • Overall data model based on GenBank flat files. This makes teaching the model to biology oriented users much easier; you can pull up a text file from NCBI with a sequence and directly show how items map to the database.

Given the usefulness of BioSQL for local relational data storage, I would like to see it move into the rapidly expanding cloud development community. Tying BioSQL data storage in with Web frameworks will help researchers make their data publicly available earlier and in standard formats. As a nice recent example, George has a series of posts on using BioSQL with Ruby on Rails. There have also been several discussions of the BioSQL mailing list around standard web tools and APIs to sit on top of the database; see this thread for a recent example.

Towards these goals, I have been working on a BioSQL backed interface for Google App Engine. Google App Engine is a Python based framework to quickly develop and deploy web applications. For data storage, Google’s Datastore provides an object interface to a distributed scalable storage backend. Practically, App Engine has free hosting quotas which can scale to larger instances as demand for the data in the application increases; this will appeal to cost-conscious researchers by avoiding an initial barrier to making their data available.

My work on this was accelerated by the Open Bioinformatics Foundation’s move to apply for participation in Google’s Summer of Code. OpenBio is a great community that helps organize projects like BioPerl, Biopython, BioJava and BioRuby. After writing up a project idea for BioSQL on Google App Engine in our application, I was inspired to finish a demonstration of the idea.

I am happy to announce a simple demonstration server running a BioSQL based backend: BioSQL Web. The source code is available from my git repository. Currently the server allows uploads of GenBank formatted files and provides a simple view of the records, annotations and sequences. The data is stored in the Google Datastore with an object interface that mimics the BioSQL relational model.

Future posts will provide more details on the internals of the server end and client interface as they develop. As always, feedback, thoughts and code contributions are very welcome.

Written by Brad Chapman

March 15, 2009 at 8:45 am