Bioinformatics open source interoperability Hackathon at the Broad Institute
On April 7th and 8th, a group of biologists and programmers gathered at the Broad Institute to work on improving interoperability of open-source bioinformatics tools. Organized by the Open Bioinformatics Foundation and GenomeSpace team, this was part of the lead up to the Bioinformatics Open Source Conference (BOSC) in July in Berlin. The event is part of an ongoing series of coding sessions (Codefests or Hackathons) organized by the open bioinformatics community, which give programmers who typically work together remotely a chance to code and discuss in the same place for two days. These have been successful in both producing new code and in building connections which help sustain development of these community projects.
Goals and outcomes
One major challenge in analyzing biological data is interfacing multiple bioinformatics tools. Tools often work independently, and where general architectures like plugins or API exist they are often project specific. This results in isolated islands of data exchange, but transferring data or resources between tools requires work that is often rate-limiting or insurmountable.
Our goal at the hackathon was to provide simple APIs and implementations that help facilitate transfers between multiple islands of functionality. GenomeSpace does this by providing a central hub and API to push and pull from tools. We wanted to generalize this to support multiple tools, and build client implementations that demonstrate this in practice. The long term goal is to encourage tool developers to provide server side APIs compatible with the more general library, making extension of the connector toolkit easier. For developers, the client API would allow them to easily transfer files between multiple tools without needing to learn and implement the specific transfer APIs of each tool.
We called this high level client library Genome Connector (gcon, for short) and took a practical approach by implementing client libraries that provide a common interface to multiple tools: GenomeSpace, Galaxy, BaseSpace, 23andMe and general key-value stores through jClouds. To identify a reasonable amount of work for two days, we focused on file transfer: authentication, finding files, getting and putting files to remote analysis platforms. In addition we defined some critical components for doing biological work:
- File metadata: We need to be able to store arbitrary key/value on objects to assign essential biological information necessary to interpret it, like organisms and genome build. In addition, metadata allows provenance and tracking of files by enabling annotation of files with history and processing steps.
- Filesets: Large biological files have secondary files with indexes, allowing indexed retrieval of data (for example: read bam and bai, variant vcf and idx, tabix gz and tbi). To avoid expensive reindexing, we want to group and transfer these together.
We also identified other useful extensions that would help improve interoperability and facilitate building connected tools, like providing Publish/subscribe hooks to avoid having to poll servers for updates, and smarter approaches to sending data to avoid duplication and unnecessary transfer of data.
The output of our discussion and coding are common Genome Connector implementations in multiple languages. GitHub repositories are available for in-progress Java, Python and Clojure implementations. These wrap multiple diverse tools and expose them through a common top level API, allowing developers to push and pull data from multiple tools.
I’m immensely grateful to the incredible participants who generously donated their time and expertise to help with these projects. For anyone interested we also have detailed documentation on discussions during the hackathon.
Bioinformatics Open Source Conference
If you’re a bioinformatics programmers interested in open source coding and helping answer biological questions by improving usability and connectivity of tools, you’re welcome to join the OpenBio and BOSC communities. We’ve created a biological interoperability mailing list for additional discussion. The next BOSC conference is July 19th and 20th in Berlin, Germany as part of the ISMB conference. There will also be another two day Codefest proceeding BOSC on July 17th and 18th. Abstracts for talks at BOSC are due this Friday, April 12th. Looking forward to seeing everyone at future BOSC and coding events.