Posts Tagged ‘OpenBio’
On April 7th and 8th, a group of biologists and programmers gathered at the Broad Institute to work on improving interoperability of open-source bioinformatics tools. Organized by the Open Bioinformatics Foundation and GenomeSpace team, this was part of the lead up to the Bioinformatics Open Source Conference (BOSC) in July in Berlin. The event is part of an ongoing series of coding sessions (Codefests or Hackathons) organized by the open bioinformatics community, which give programmers who typically work together remotely a chance to code and discuss in the same place for two days. These have been successful in both producing new code and in building connections which help sustain development of these community projects.
Goals and outcomes
One major challenge in analyzing biological data is interfacing multiple bioinformatics tools. Tools often work independently, and where general architectures like plugins or API exist they are often project specific. This results in isolated islands of data exchange, but transferring data or resources between tools requires work that is often rate-limiting or insurmountable.
Our goal at the hackathon was to provide simple APIs and implementations that help facilitate transfers between multiple islands of functionality. GenomeSpace does this by providing a central hub and API to push and pull from tools. We wanted to generalize this to support multiple tools, and build client implementations that demonstrate this in practice. The long term goal is to encourage tool developers to provide server side APIs compatible with the more general library, making extension of the connector toolkit easier. For developers, the client API would allow them to easily transfer files between multiple tools without needing to learn and implement the specific transfer APIs of each tool.
We called this high level client library Genome Connector (gcon, for short) and took a practical approach by implementing client libraries that provide a common interface to multiple tools: GenomeSpace, Galaxy, BaseSpace, 23andMe and general key-value stores through jClouds. To identify a reasonable amount of work for two days, we focused on file transfer: authentication, finding files, getting and putting files to remote analysis platforms. In addition we defined some critical components for doing biological work:
- File metadata: We need to be able to store arbitrary key/value on objects to assign essential biological information necessary to interpret it, like organisms and genome build. In addition, metadata allows provenance and tracking of files by enabling annotation of files with history and processing steps.
- Filesets: Large biological files have secondary files with indexes, allowing indexed retrieval of data (for example: read bam and bai, variant vcf and idx, tabix gz and tbi). To avoid expensive reindexing, we want to group and transfer these together.
We also identified other useful extensions that would help improve interoperability and facilitate building connected tools, like providing Publish/subscribe hooks to avoid having to poll servers for updates, and smarter approaches to sending data to avoid duplication and unnecessary transfer of data.
The output of our discussion and coding are common Genome Connector implementations in multiple languages. GitHub repositories are available for in-progress Java, Python and Clojure implementations. These wrap multiple diverse tools and expose them through a common top level API, allowing developers to push and pull data from multiple tools.
I’m immensely grateful to the incredible participants who generously donated their time and expertise to help with these projects. For anyone interested we also have detailed documentation on discussions during the hackathon.
Bioinformatics Open Source Conference
If you’re a bioinformatics programmers interested in open source coding and helping answer biological questions by improving usability and connectivity of tools, you’re welcome to join the OpenBio and BOSC communities. We’ve created a biological interoperability mailing list for additional discussion. The next BOSC conference is July 19th and 20th in Berlin, Germany as part of the ISMB conference. There will also be another two day Codefest proceeding BOSC on July 17th and 18th. Abstracts for talks at BOSC are due this Friday, April 12th. Looking forward to seeing everyone at future BOSC and coding events.
I’ve recently moved positions to the bioinformatics core at Harvard School of Public Health. It’s a great place to do science, with plenty of researchers doing interesting work and actively looking for bioinformatics collaborators. The team, working alongside members of the Hide Lab, is passionate about open source work. Both qualities made it a great fit for my interests and experience.
My new group is currently hiring bioinformatics researchers. The work involves interacting collaboratively with a research group to understand their biological problem, creatively attacking the mountains of data underlying the research question, and presenting the results back in an intuitive fashion. On the programming side, it’s an opportunity to combine existing published toolkits with your own custom algorithms and approaches. On the biology side, you should be passionate and interested in thinking of novel ways to advance our understanding of the problems. Practically, all of this work will involve a wide range of technologies and approaches; I expect plenty of next-generation sequencing data and lots of learning about the best ways to scale analyses.
Our other goal is to build re-usable tools for the larger research community. We work extensively with analysis frameworks like Galaxy and open standards like ISA-Tab. We hope to extract the common parts from disparate experiments to build abstractions that help get new analyses done quicker. Tool building also involves automating and deploying analysis pipelines in a way that allows biologists to run them directly. By democratizing analyses and presenting results to researchers at a high level they can directly interact with, science is accelerated and the world becomes an awesomer place.
So if you enjoy the work I write about here, and have always secretly wanted to sit in an office right next to me, now is your big chance (no stalkers, please). If this sounds of interest, please get in touch and I’d be happy to pass along more details.
Amazon web services provide scalable, on demand computational resources through their elastic compute cloud (EC2). Previously, I described the goal of providing publicly available machine images loaded with bioinformatics tools. I’m happy to describe an initial step in that direction: an automated build system, using easily editable configuration files, that generates a bioinformatics-focused Amazon Machine Image (AMI) containing packages integrated from several existing efforts. The hope is to consolidate the community’s open source work around a single, continuously improving, machine image.
This image incorporates software from several existing AMIs:
- JCVI Cloud BioLinux — JCVI’s work porting Bio-Linux to the cloud.
- bioperl-max — Fortinbras’ package of BioPerl and associated informatics tools.
- MachetEC2 — An InfoChimps image loaded with data mining software.
Each of these libraries inspired different aspects of developing this image and associated infrastructure, and I’m extremely grateful to the authors for their code, documentation and discussions.
The current AMI is available for loading on EC2 — search for ‘CloudBioLinux’ in the AWS console or go to the CloudBioLinux project page for the latest AMIs. Automated scripts and configuration files with contained packages are available as a GitHub repository.
This image is intended as a starting point for developing a community resource that provides biology and data-mining oriented software. Experienced developers should be able to fire up this image and expect to find the same up to date libraries and programs they have installed on their work machines. If their favorite package is missing it should be quick and easy to add, making the improvement available to future developers.
Achieving these goals requires help and contributions from other programmers utilizing the cloud — everyone reading this. The current image is ready to be used, but is more complete in areas where I normally work. For instance, the Python and R libraries are off to a good start. I’d like to extend an invitation to folks with expertise in other areas to help improve the coverage of this AMI:
- Programmers: help expand the configuration files for your areas of interest:
- Perl CPAN support and libraries
- Ruby gems
- Java libraries
- Haskell hackage support and libraries
- Erlang libraries
- Bioinformatics areas of specialization:
- Next-gen sequencing
- Structural biology
- Parallelized algorithms
- Much more… Let us know what you are interested in.
- Documentation experts: provide cookbook style instructions to help others get started.
- Porting specialists: The automation infrastructure is dependent on having good ports for libraries and programs. Many widely used biological programs are not yet ported. Establishing a Debian or Ubuntu port for a missing program will not only help this effort, but make the programs more widely available.
- Systems administrators: The ultimate goal is to have the AMI be automatically updated on a regular basis with the latest changes. We’d like to set up an Amazon instance that pulls down the latest configuration, populates an image, builds the AMI, and then updates a central web page and REST API for getting the latest and greatest.
- Testers: Check that this runs on open source Eucalyptus clouds, additional linux distributions, and other cloud deployments.
If any of this sounds interesting, please get in contact. The Cloud BioLinux mailing list is a good central point for discussion.
In addition to supplying an image for downstream use, this implementation was designed to be easily extendible. Inspired by the MachetEC2 project, packages to be installed are entered into a set of easy to edit configuration files in YAML syntax. There are three different configuration file types:
- main.yaml — The high level configuration file defining which groups of packages to install. This allows a user to build a custom image simply by commenting out those groups which are not of interest.
- packages.yaml — Defines debian/ubuntu packages to be installed. This leans heavily on the work of DebianMed and Bio-Linux communities, as well as all of the hard working package maintainers for the distributions. If it exists in package form, you can list it here.
- python-libs.yaml, r-libs.yaml — These take advantage of language specific ways of installing libraries. Currently implemented is support for Python library installation from the Python package index, and R library installation from CRAN and Bioconductor. This will be expanded to include support for other languages.
We hope that the straightforward architecture of the build system will encourage other developers to dig in and provide additional coverage of program and libraries through the configuration files. For those comfortable with Python, the fabfile is very accessible for adding in new functionality.
If you are interested in face-to-face collaboration and will be in the Boston area on July 7th and 8th, check out Codefest 2010; it’ll be two enjoyable days of cloud informatics development. I’m looking forward to hearing from other developers who are interested in building and maintaining an easy to use, up to date, machine image that can help make biological computation more accessible to the community.
I spent the last week in Tokyo, Japan at BioHackathon 2010, an extremely productive meet up of open source coders organized by Toshiaki Katayama and the great folks at the Database Center of Life Sciences (DBCLS). The focus of the week was improving biological toolkits for accessing the Semantic Web of linked data.
The technical focus was on RDF (Resource Description Format), a standard way to represent data as a triple; each triple is made up of a subject defining the item, a predicate specifying a relationship, and a object representing the linked data. By providing data in RDF along with common naming schemes for common objects we facilitate linking biological data in mashups, expanding the ability of researchers to discover relationships between disparate data sources.
RDF is managed in data stores like Virtuoso, which are equivalent to relational or document based databases. For programmers, the primary technology for querying these stores is SPARQL, a query language similar to SQL. The goal of the Biopython programming team at the Hackathon was to provide an easy to use Python library to query RDF stores through SPARQL.
The python interface organizes common biological queries on a datastore without exposing the backend implementation details. The final interface simplifies the process of making a query to two steps:
- Instantiate a query builder, providing it with two sets of data: attributes to retrieve and items to filter against. This is modeled after BioMart querying and the R biomaRt interface, providing a generic, well understood way to specify the query.
- Pass the query builder to a retrieval server which executes the query and returns the results in a tabular format, as a numpy RecordArray.
The user is never exposed to the underlying implementation details, as the library performs the work of building the query, submitting it to the remote server and reformatting the results.
Two interfaces were implemented at BioHackathon 2010:
- BioGateway — a SPARQL endpoint to a RDF data store containing SwissProt data semantically linked to Gene Ontology (GO) terms.
- InterMine — a XML based interface to a traditional relational database backend, containing organized metadata for primary experimental data from many organisms.
By providing a common interface for both semantic data and more traditional data sources, we hope to facilitate the conversion by data providers to RDF where it simplifies their backend storage and queries. Users of this high level interface do not need to worry about the underlying implementation, instead focusing resources on developing their biological queries.
BioGateway organizes the SwissProt database of protein sequences along with Gene Ontology Annotations (GOA) into an integrated RDF database. Data access is provided through a SPARQL query endpoint, allowing searches for proteins based on a combination of GO and SwissProt data.
This query searches for proteins that are involved in insulin response and linked to diabetes. The protein name, other proteins associated via protein-protein interaction, and the gene name are retrieved:
from systemsbio import Biogateway, UniProtGOQueryBuilder builder = UniProtGOQueryBuilder("Homo sapiens") builder.add_filter("GO_term", "insulin") builder.add_filter("disease_description", "diabetes") builder.add_attributes(["protein_name", "interactor", "gene_name"]) server = Biogateway() results = server.search(builder) print len(results), results.dtype.names result = results print result['protein_name'], result['gene_name'], \ result['interactor'], result['GO_term']
An orthogonal search approach is to start with a protein of interest and retrieve linked details. Here we identify primary journal papers about a protein:
from systemsbio import Biogateway, ReferenceBuilder builder = ReferenceBuilder() builder.add_filter("protein_id", "1433B_HUMAN") builder.add_attributes(["reference"]) server = Biogateway() results = server.search(builder) print len(results), results.dtype.names result = results print result['protein_id'], result['reference']
from Bio import Entrez Entrez.email = "email@example.com" pubmed_id = result['reference'].replace("PMID_", "") handle = Entrez.esummary(db="pubmed", id=pubmed_id) record = Entrez.read(handle) print record['Title'] print record['PubDate'] print ",".join(record['AuthorList']) print record['FullJournalName'], record['Volume'], record['Pages'] # Novel raf kinase protein-protein interactions found by an exhaustive yeast two-hybrid analysis. # 2003 Feb # Yuryev A,Wennogle LP # Genomics 81 112-25
The full source code is available from GitHub: systemsbio.py. The implementation builds a SPARQL query based on the provided attributes and filters, using SPARQLwrapper to interact with the remote server and parse the results.
InterMine is an open source data management system used to power databases of primary research results like FlyMine and modMine. It stores metadata associated with projects in a structured way, enabling searches to identify data submissions of interest to biologists. It contains two useful web based tools to facilitate these searches:
Templates — Pre-defined queries that capture common ways biologists search the database.
Query builder — A graphical interface to define custom queries, allowing manual discovery of attributes of interest.
We access InterMine programatically using the same builder and server paradigms used in our BioGateway interface. The query below searches modMine for C. elegans experiments characterizing the H3K4Me3 histone modification, which is associated with chromatin structure in active genes. The returned submission identifiers can be used to examine the primary data associated with the experiment:
from intermine import Intermine, SubmissionQueryBuilder builder = SubmissionQueryBuilder() builder.add_attributes(["submission_id", "submission_title", "developmental_stage"]) builder.add_filter("organism", "Caenorhabditis elegans") builder.add_filter("antibody_name", "H3K4me3") server = Intermine("http://intermine.modencode.org") table = server.search(builder) print table.dtype.names print table # ('submission_id', 'submission_title', 'developmental_stage') # [('176', 'Histone_H3K4me3_from_N2_L3_larvae_(AR0169_H3K4me3_N2_L3)', '') # ('2311', 'Histone_H3K4me3_N2_EEMB_(WA30534819_H3K4ME3_N2_EEMB)', # 'Early Stage Embryos') # ('2410', 'Histone_H3K79me1_N2_EEMB_(AB2886_H3K79ME1361912_N2_EEMB)', # 'Early Stage Embryos')]
An advantage of defining query builders is that we can provide custom functionality to access more complex queries. The code below searches for C. elegans ChIP-seq experiments using a free text search. The implementation searches for the query term against several description fields in the database, hiding these details from the user:
from intermine import Intermine, ExperimentQueryBuilder builder = ExperimentQueryBuilder() builder.add_attributes(["submission_id", "experiment_name"]) builder.add_filter("organism", "Caenorhabditis elegans") builder.free_text_filter("ChIP-seq") server = Intermine("http://intermine.modencode.org") table = server.search(builder) print table.dtype.names print table # ('submission_id', 'experiment_name') # [('582', 'ChIP-Seq Identification of C. elegans TF Binding Sites') # ('584', 'ChIP-Seq Identification of C. elegans TF Binding Sites') # ...
It is not a coincidence that a diverse set of tools like InterMine, BioGateway and BioMart were used in building these interfaces. The collaborative environment at BioHackathon 2010 facilitated productive discussions with the authors of these projects, leading to the API development and implementation. If you are interested in more details about the week of programming, check out the day to day summaries:
You are invited to fork and extend the code on GitHub.
The BioSQL project provides a well thought out relational database schema for storing biological sequences and annotations. For those developers who are responsible for setting up local stores of biological data, BioSQL provides a huge advantage via reusability. Some of the best features of BioSQL from my experience are:
- Available interfaces for several languages (via Biopython, BioPerl, BioJava and BioRuby).
- Flexible storage of data via a key/value pair model. This models information in an extensible manner, and helps with understanding distributed key/value stores like SimpleDB and CouchDB.
- Overall data model based on GenBank flat files. This makes teaching the model to biology oriented users much easier; you can pull up a text file from NCBI with a sequence and directly show how items map to the database.
Given the usefulness of BioSQL for local relational data storage, I would like to see it move into the rapidly expanding cloud development community. Tying BioSQL data storage in with Web frameworks will help researchers make their data publicly available earlier and in standard formats. As a nice recent example, George has a series of posts on using BioSQL with Ruby on Rails. There have also been several discussions of the BioSQL mailing list around standard web tools and APIs to sit on top of the database; see this thread for a recent example.
Towards these goals, I have been working on a BioSQL backed interface for Google App Engine. Google App Engine is a Python based framework to quickly develop and deploy web applications. For data storage, Google’s Datastore provides an object interface to a distributed scalable storage backend. Practically, App Engine has free hosting quotas which can scale to larger instances as demand for the data in the application increases; this will appeal to cost-conscious researchers by avoiding an initial barrier to making their data available.
My work on this was accelerated by the Open Bioinformatics Foundation’s move to apply for participation in Google’s Summer of Code. OpenBio is a great community that helps organize projects like BioPerl, Biopython, BioJava and BioRuby. After writing up a project idea for BioSQL on Google App Engine in our application, I was inspired to finish a demonstration of the idea.
I am happy to announce a simple demonstration server running a BioSQL based backend: BioSQL Web. The source code is available from my git repository. Currently the server allows uploads of GenBank formatted files and provides a simple view of the records, annotations and sequences. The data is stored in the Google Datastore with an object interface that mimics the BioSQL relational model.
Future posts will provide more details on the internals of the server end and client interface as they develop. As always, feedback, thoughts and code contributions are very welcome.
The BioSQL relational database model provides a set of tables for storing biological sequences and information related to them. The model covers many common use cases when representing sequence information, and serves as a reusable component that can be plugged into many projects.
For larger web-based projects, users often would like a permissions system that:
- Allows storage of data available only to the user, for collecting experimental data that is in progress.
- Shares data with other colleagues within defined groups.
- Defines the ability to edit data.
- Stores user specific preferences and defaults.
Here I present a simple relational database model for MySQL that ties in with BioSQL and provides this functionality. The full schema can be downloaded from github and is described in more detail below.
The authorization framework is based on a three tiered system used in the Turbogears web framework:
- Users — these are the base logged in users. The
auth_usertable defines them and includes user_names, passwords, and e-mail information.
- Groups — each user can belong to one or more groups. These groups will often represent real life groups, like research laboratories or collaborative groups.
- Permissions — each group has one or more permissions associated with it. This assigns a group one or more things it can do or access.
In BioSQL, we can associate permissions with many items using generic key value pairs. These are implemented as qualifier values, and many different items have this association mechanism including biological entries (bioentry), features, and database references (dbxref). The BioSQL schema description goes into more detail about the representation, which is very flexible.
As an example of using this framework, suppose you add several new sequence you want to be available to only users in your research group. You would do the following:
- Create a new permission (
- Associate the permission with your group (
- Assign the new permission id to a qualifer value named permissions on the bioentries of interest (
bioentry_qualifer_valuetable, in BioSQL).
The web interface should then be designed to check for permissions and display only those available to the current users. This can be done very flexibly according to your design preferences; in the most transparent case displays would only include those items for which permissions are available. Users without permissions could proceed unaware of these items they cannot access.
The final desired item of functionality is storing user specific information. This helps a user set defaults for a web interface, save queries or other complex information, and generally achieve a more pleasant browsing experience. This information is stored using the same key-value pair mechanism described above, in the
auth_user_qualifier_value table. This should be very familiar to BioSQL developers, and flexibly allows us to store any kind of information of interest.
Hopefully this authentication system is useful to others implementing interfaces on top of BioSQL databases. In subsequent posts, I will provide client and server code which helps manage user authorization within this system.
A recent thread by Peter on the BioSQL mailing list initiated some thinking about formalizing ontologies and terms in BioSQL. The current ad-hoc solution is that BioPerl, Biopython and BioJava attempt to use the same naming schemes. The worry is that this is not documented, no one is likely in a big hurry to document it, and we are essentially inventing another ontology.
The BioSQL methodology of storing key/value pair information on items can be mapped to RDF triples as:
|Bioentry or Feature||Subject|
|Ontology||Namespace of predicate|
|Term||Predicate term, relative to namespace|
Thus, a nice place to look for ontologies is in standards intended for RDF. Greg Tyrelle thought this same way a while ago and came up with a XSLT to transform GenBank XML to RDF, using primarily the Dublin Core vocabulary. On the biology side, the Sequence Ontology project provides an ontology meant for describing biological sequences. This includes a mapping to GenBank feature table names.
Using these as a starting point, I generated a mapping of GenBank names to names in the Dublin Core and SO ontologies. This is meant as a basis for standardizing and documenting naming in BioSQL. The mapping file thus far covers almost all of the header and feature keys, and more than half of the qualifier keys:
- Tab delimited mapping file
- All of the python code that does the mapping and pulls information from associated files is available: github repository.
I would welcome suggestions for missing GenBank terms, as well as corrections on the terms mapped by hand.
Some notes on the mapping:
- Cross references to other identifiers are mapped with the Dublin Core term ‘relation’. These can occur in many places in the GenBank format. Using a single term allows them to be flattened, with mapping values in form of ‘database:identifier.’ This is consistent with the GenBank /db_xref qualifier.
- Multiple names or descriptions of an item, also stored in multiple places in GenBank files, receive the Dublin Core term ‘alternative.’
- Organism and taxonomy ontologies are a whole project onto themselves, so I didn’t try to tackle them here.
Some other useful links for biological ontology mapping: