Blue Collar Bioinformatics

Usage plans for Amazon Web Services research grant

with 7 comments

Amazon Web Services provide an excellent distributed computing infrastructure through their Elastic Compute Cloud (EC2), Elastic Block Storage (EBS) and associated resources. Essentially, they make available on demand compute power and storage at prices that scale with usage. In the past I’ve written about using EC2 for parallel parsing of large files. Generally, I am a big proponent of distributed computing as a solution to dealing with problems ranging from job scaling to improving code availability.

One of the challenges in advocating for using EC2 at my day to day work is the presence of existing computing resources. We have servers and clusters, but how will we scale for future work? Thankfully, we are able to assess the utility of Amazon services for future scaling through their education and research grants. Our group applied and was accepted for a research grant which we plan to use to develop and distribute next generation sequencing analyses both within our group at Mass General Hospital and in the larger community.

Amazon Machine Images (AMIs) provide an opportunity for the open source bioinformatics community to increase code availability. AMIs are essentially pre-built operating systems with installed programs. By creating AMIs and making them available, a programmer can make their code readily accessible to users and avoid any of the intricacies of installation and configuration. Add this to available data in the form of public data sets and you have a ready to go analysis platform with very little overhead. There is already a large set of available AMIs from which to build.

This idea and our thoughts on moving portions of our next generation sequencing analysis to EC2 are fleshed out further in our research grant application, portions of which are included below. We’d love to collaborate with others moving their bioinformatics work to Amazon resources.

Research Background

One broad area of rapid growth in biology research is deep sequencing (or short read) technology. A single lab investigator can produce hundreds of millions of DNA sequences, equivalent in scale to the entire human genome, in a period of days. This DNA sequencing technology is widely available through both on-site facilities as well as through commercial services. Creating scalable analysis methods is a high priority for the entire bioinformatics community; see http://selab.janelia.org/people/eddys/blog/?p=123 for a presentation nicely summarizing the issues. We propose to address the computational bottlenecks resulting from this huge data volume using distributed AWS resources.

An additional aim of our work is to provide tools to biologists looking to solve their data analysis challenges. When the computational portion of a project becomes a time limiting step, we can often speed up the cycling between experiment and analysis by providing researchers with ready to run scripts or web interfaces. However, this is complicated by high usage on shared computational resources and heterogeneous platforms requiring time consuming configuration. Both problems could be ameliorated by scalable EC2 instances with custom configured machine images.

The goals of this grant application are to develop our analysis platform on Amazon’s compute cloud and assess transfer, storage and utilization costs. We currently have internal computational resources ranging from high performance clusters to large memory machines. We believe Amazon’s compute cloud to be an ideal solution as our analysis needs outgrow our current hardware.

Benefits to Amazon and the community

Developing software on AWS architecture presents a move towards a standard platform for bioinformatics research. Our group is invested in the open source community and shares both code and analysis tools. One common hindrance to sharing is the heterogeneity of platforms; code is developed on a local cluster and not readily generalizable, hence it is not shared.

By building public machine images along with reusable source code, a diverse variety of users can readily use our code and tools. As short read sequencing continues to increase in utility and popularity, a practical ready-to-go platform for analyses will encourage many users to adopt parallelization on cloud resources as a research approach. We have begun initial work with this paradigm by developing parsers for large annotation files using MapReduce on EC2.

Having the ability to utilize AWS with your support will help us further develop and disseminate analysis templates for the larger biology community, enabling science both at MGH and elsewhere.

Written by Brad Chapman

September 7, 2009 at 7:42 pm

7 Responses

Subscribe to comments with RSS.

  1. My team has also been looking at AWS as a way to provide a scalable, sharable platform for analysis using public tools and leveraging the available public datasets hosted with Amazon. We’ve faced questions related to the time and opportunity cost of uploading the data to the cloud for analysis and collaboration. Is this an issue for your team? If so, how are you approaching this challenge as you think about solutions for future work?

    Cheryl

    September 8, 2009 at 8:27 pm

    • Cheryl;
      Upload is definitely a potential bottleneck that we’ll have to learn to work with. Right now we don’t have any very specific plans. The general idea of our grant work is to determine tasks that are better suited to local hardware — processing large files, like image files — and those that can move to cloud infrastructure — things like parallel downstream analysis of read data. Where the sweet spot falls between being too time consuming to upload and setup on the cloud, to being cloud friendly, is something we hope to learn with a little trial and error.

      Brad

      Brad Chapman

      September 9, 2009 at 7:43 am

  2. Have you taken an interest in Eucaluptus (http://open.eucalyptus.com/)? As I understand it, it allows you to set up an AWS-compatible machine image on your local cluster and run it there, and move the image to AWS with minimal hassle when a task requires more juice.

    Eric Talevich

    September 19, 2009 at 10:20 am

    • Hi Eric;
      Eucalyptus is great, although I haven’t had the chance to play with it. I know some other researchers in the Boston area are going the route you describe — starting with a cluster built on Eucalyptus and then moving things to AWS as needed. It’s a great way to go and would give more control. Logistically, our cluster is a larger shared resource so we don’t have a lot of direct control over what management software is used. My hope is to demonstrate utility with AWS and move back the opposite way to encourage adoption of something like Eucalyptus.

      Brad

      Brad Chapman

      September 21, 2009 at 6:37 am

  3. Not trying to start a distro war but does ubuntu’s server cloud fit into your code testing as well? I have been egging abi to make their software ubuntu compatible because it has the cloud computing service and it’s more user friendly cf centos. But so far no response from them

    Kevin

    March 4, 2010 at 11:46 am

    • Kevin;
      I run Ubuntu on my notebook, and think it’s a good target for general AMIs aimed at end users. I don’t have much experience with ABI tools, but getting them running on Ubuntu shouldn’t be too bad if they work fine on other linux flavors.

      Brad Chapman

      March 5, 2010 at 7:51 am


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: