Blue Collar Bioinformatics

CloudBioLinux: progress on bioinformatics cloud images and data

with 4 comments

My last post introduced a framework for building bioinformatics cloud images, which makes it easy to do biological computing work using Amazon EC2 and other on-demand computing providers. Since that initial announcement we’ve had amazing interest from the community and made great progress with:

New software and data

The most exciting changes have been the rapid expansion of installed software and libraries. The goal is to provide an image that experienced developers will find as useful as their custom configured servers. A great group of contributors have put together a large set of programs and libraries; the configuration files have all the details on installed programs as well as libraries for Python, Perl, Ruby, and R. Another addition is support for non-packaged programs which provides software not yet neatly wrapped in a package manger or library-specific install system: next-gen software packages like Picard, GATK and Bowtie are installed through custom scripts.

To improve accessibility for developers who prefer a desktop experience, a FreeNX server was integrated with the provided images. Tim Booth from the NEBC Bio-Linux team headed up the integration of FreeNX, and the user experience looks very similar to a locally installed Bio-Linux desktop.

In addition to the software image, a publicly available data volume is now available that contains:

  • Genome sequences pre-indexed for search with next-gen aligners like Bowtie, Novoalign, and BWA.
  • LiftOver files for mapping between sequence coordinates.
  • UniRef protein databases, indexed for searching with BLAST+.

Coupled with the software images, this volume makes it easy to do next-gen analyses. Start up an Amazon AMI, attach the genome data volume, transfer your fastq file to the instance, and kick off the analysis. The overhead of software installation and genome indexing is completely removed. Thanks to the work of Enis Afgan and James Taylor of Galaxy, the data volume plugs directly into Galaxy’s ready to use cloud image. Coupling the data and software with Galaxy provides a familiar web interface for running tools and developing biological workflows.

The data volume preparation is fully automated via a fabric install script, similar to the software install script. Additional data sources are easily integrated, and we hope to expand the available datasets based on feedback from the community.

Documentation and presentations

The software and data volumes are only as good as the documentation which helps people use them:

Community: Codefest 2010

The CloudBioLinux community had a chance to work together for two days in July at Codefest 2010. In conjunction with the Bioinformatics Open Source Conference (BOSC) in Boston, this was a free to attend coding session hosted at Harvard School of Public Health and Massachusetts General Hospital. Over 30 developers donated two days of their time to working on CloudBioLinux and other bioinformatics open source projects.

Many of the advances in CloudBioLinux detailed above were made possible through this session: the FreeNX graphical client integration, documentation, Galaxy interoperability, and many library and data improvements were started during the two days of coding and discussions. Additionally, the relationships developed are the foundation for better communication amongst open source projects, which is something we need to be continually striving for in the scientific computing world.

It was amazing and inspiring to get such positive feedback from so many members of the bioinformatics community. We’re planning another session next year in Vienna, again just before BOSC and ISMB 2011; and again, everyone is welcome.

Summary

Go to the CloudBioLinux website for the latest publicly available images and data volumes, which are ready to use on Amazon EC2. With Amazon’s new micro-images you can start analyzing data for only a few cents an hour. It’s an easy way to explore if cloud resources will help with computational demands in your work. We’re very interested in feedback and happy to have other developers helping out; please get in touch on the CloudBioLinux mailing list.

Written by Brad Chapman

October 13, 2010 at 6:19 pm

Posted in OpenBio

Tagged with , , ,

4 Responses

Subscribe to comments with RSS.

  1. How difficult would it be to create other virtual machine image formats such as VirtualBox with this script?

    Thanks for using FreeNX, it makes the graphical remote desktop quite responsive.

    Mike Chelen

    October 14, 2010 at 11:15 am

    • Mike;
      There is nothing cloud specific about the build script; it will work fine on any Ubuntu hardware from a local server to other virtual machines. Roman Valls Guimera is working on Eucalyptus support currently, and we’d definitely like to see other approaches. Let us know if you have luck with VirtualBox.

      Glad you like FreeNX. Tim is the man to thank for the suggestion to use it and all the integration work.

      Brad Chapman

      October 14, 2010 at 6:36 pm

  2. Thanks for posting this. You put great stuff on this blog and I’m pleased that you’re still writing it. Some colleagues and I were just taking about moving some of our analyses to EC2. I’m emailing the link momentarily.

    Nick Crawford

    October 15, 2010 at 5:39 pm

  3. […] CloudBioLinux: progress on bioinformatics cloud images and data – My last post introduced a framework for building bioinformatics cloud images, which makes it easy to do biological computing work using Amazon EC2 and other on-demand computing providers. Since that initial announcement we’ve had amazing interest from the community and made great progress with: A permanent web site at cloudbiolinux.org; Additional software and genomic data; New user documentation; A community coding session: Codefest 2010 … […]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: