Blue Collar Bioinformatics

Note: new posts have moved to http://bcb.io/ Please look there for the latest updates and comments

Automated build environment for Bioinformatics cloud images

with 15 comments

Amazon web services provide scalable, on demand computational resources through their elastic compute cloud (EC2). Previously, I described the goal of providing publicly available machine images loaded with bioinformatics tools. I’m happy to describe an initial step in that direction: an automated build system, using easily editable configuration files, that generates a bioinformatics-focused Amazon Machine Image (AMI) containing packages integrated from several existing efforts. The hope is to consolidate the community’s open source work around a single, continuously improving, machine image.

This image incorporates software from several existing AMIs:

  • JCVI Cloud BioLinux — JCVI’s work porting Bio-Linux to the cloud.
  • bioperl-max — Fortinbras’ package of BioPerl and associated informatics tools.
  • MachetEC2 — An InfoChimps image loaded with data mining software.

Each of these libraries inspired different aspects of developing this image and associated infrastructure, and I’m extremely grateful to the authors for their code, documentation and discussions.

The current AMI is available for loading on EC2 — search for ‘CloudBioLinux’ in the AWS console or go to the CloudBioLinux project page for the latest AMIs. Automated scripts and configuration files with contained packages are available as a GitHub repository.

Contributions encouraged

This image is intended as a starting point for developing a community resource that provides biology and data-mining oriented software. Experienced developers should be able to fire up this image and expect to find the same up to date libraries and programs they have installed on their work machines. If their favorite package is missing it should be quick and easy to add, making the improvement available to future developers.

Achieving these goals requires help and contributions from other programmers utilizing the cloud — everyone reading this. The current image is ready to be used, but is more complete in areas where I normally work. For instance, the Python and R libraries are off to a good start. I’d like to extend an invitation to folks with expertise in other areas to help improve the coverage of this AMI:

  • Programmers: help expand the configuration files for your areas of interest:
    • Perl CPAN support and libraries
    • Ruby gems
    • Java libraries
    • Haskell hackage support and libraries
    • Erlang libraries
    • Bioinformatics areas of specialization:
      • Next-gen sequencing
      • Structural biology
      • Parallelized algorithms
    • Much more… Let us know what you are interested in.
  • Documentation experts: provide cookbook style instructions to help others get started.
  • Porting specialists: The automation infrastructure is dependent on having good ports for libraries and programs. Many widely used biological programs are not yet ported. Establishing a Debian or Ubuntu port for a missing program will not only help this effort, but make the programs more widely available.
  • Systems administrators: The ultimate goal is to have the AMI be automatically updated on a regular basis with the latest changes. We’d like to set up an Amazon instance that pulls down the latest configuration, populates an image, builds the AMI, and then updates a central web page and REST API for getting the latest and greatest.
  • Testers: Check that this runs on open source Eucalyptus clouds, additional linux distributions, and other cloud deployments.

If any of this sounds interesting, please get in contact. The Cloud BioLinux mailing list is a good central point for discussion.

Infrastructure overview

In addition to supplying an image for downstream use, this implementation was designed to be easily extendible. Inspired by the MachetEC2 project, packages to be installed are entered into a set of easy to edit configuration files in YAML syntax. There are three different configuration file types:

  • main.yaml — The high level configuration file defining which groups of packages to install. This allows a user to build a custom image simply by commenting out those groups which are not of interest.
  • packages.yaml — Defines debian/ubuntu packages to be installed. This leans heavily on the work of DebianMed and Bio-Linux communities, as well as all of the hard working package maintainers for the distributions. If it exists in package form, you can list it here.
  • python-libs.yaml, r-libs.yaml — These take advantage of language specific ways of installing libraries. Currently implemented is support for Python library installation from the Python package index, and R library installation from CRAN and Bioconductor. This will be expanded to include support for other languages.

The Fabric remote automated deployment tool is used to build AMIs from these configuration files. Written in Python, the fabfile automates the process of installing packages on the cloud machine.

We hope that the straightforward architecture of the build system will encourage other developers to dig in and provide additional coverage of program and libraries through the configuration files. For those comfortable with Python, the fabfile is very accessible for adding in new functionality.

If you are interested in face-to-face collaboration and will be in the Boston area on July 7th and 8th, check out Codefest 2010; it’ll be two enjoyable days of cloud informatics development. I’m looking forward to hearing from other developers who are interested in building and maintaining an easy to use, up to date, machine image that can help make biological computation more accessible to the community.

Written by Brad Chapman

May 8, 2010 at 9:35 am

Posted in OpenBio

Tagged with , , ,

15 Responses

Subscribe to comments with RSS.

  1. Before you go through a really big development effort, have you at least looked at Puppet and Chef? Both are fairly easy to understand and already have large user and developer communities.

    http://www.puppetlabs.com/
    http://www.opscode.com/chef

    If they are indeed too complex for your target audience, I suggest that you at least base your build back-end to use one of these tools.

    Angel

    May 11, 2010 at 7:53 am

    • Angel;
      Definitely. This uses Fabric, which is an analogous deployment system in Python. The only reason to choose Fabric over Puppet or Chef is my familiarity with Python; a ruby hacker could re-use the sample configuration structure with those systems.

      The development effort here focuses on building up a good default configuration of useful packages and libraries. I’m all about leveraging these existing systems instead of re-writing them.

      Brad Chapman

      May 11, 2010 at 8:15 am

      • I’m not 100% sure about Chef but from what I gather Puppet is substantially more than simply a deployment package. Ripped from their site:
        “The user does not have to rely on error prone state transitions — Puppet will make your system look as desired, without needing to specify the commands needed to put it in that state relative to the current state.”

        http://www.puppetlabs.com/puppet/introduction/

        That being said I’m stoked to see this kind of a project being started. I’ve had similar ideas for about a year now but never got around to making anything happen. Also as a Python guy I’m happy to see things going the way they are.

        Mike Sandford

        June 3, 2010 at 12:12 pm

        • Mike;
          Chef and Puppet definitely do a lot more, although I wasn’t able to pick out any parts outside of deployment that could be reused for this particular project. This may be due to my lack of Ruby skills; I’d be happy to hear pointers from you or other folks about specific functionality that could be adopted here.

          Glad to hear this project matches up with similar things you’ve been thinking about. The python code is pretty minimal, so hopefully it’s easy to dig into. I’d be happy to have you on board; let us know if you have any questions.

          Brad Chapman

          June 4, 2010 at 6:58 am

  2. Great stuff, this is a really useful initiative. I will try to remember it next time I need to package up something which requires a bunch of dependencies and would benefit from running on EC2. I have a couple of things in mind.

    Nick Loman

    May 30, 2010 at 1:16 pm

    • Nick;
      Sounds great. Additional package suggestions would be very welcome. Let me know what I can do to help.

      Brad Chapman

      May 31, 2010 at 6:44 am

  3. Hi Brad,

    We are indexing your Amazon AMIs at http://thecloudmarket.com/owner/678711657553–brad-chapman

    Feel free to claim your account and update the information about you and your AMIs.

    Cheers

    TheCloudMarket

    June 23, 2010 at 4:47 am

  4. Brad:
    Fabric is something different, and possibly lower-level (as applied to “language”, not to “quality”) than puppet or chef. The Python equivalent of puppet would be BCFG2 (http://trac.mcs.anl.gov/projects/bcfg2/).

    Ivan Rossi

    June 25, 2010 at 5:54 am

    • Ivan;
      Thanks for the pointer to BCFG2. I’ve been spending some more time working with Chef and agree with you on the coverage: Fabric handles remote deployment issues, while Chef builds on this to provide utilities to deal with heterogeneous systems and configuration management.

      Brad Chapman

      June 28, 2010 at 5:34 am

  5. Brad,
    As an end-user rather than developer, I am happy to see this kind of effort to bring cloud computing to the unwashed masses of biologists. One question – why deploy a 32-bit image rather than 64-bit? My motivation for moving to the cloud is a single lane of Illumina sequencing data overwhems my Linux box with 8 Gb of RAM, so a 32-bit image in the cloud won’t do me much good. I am looking for information on setting up a 64-bit image with at least 16 Gb RAM in the cloud, and would welcome any pointers.

    Ross Whetten

    August 31, 2010 at 11:31 am

    • Ross;
      Absolutely agreed that 64-bit images are the way to go for large scale analyses. Since this post, we’ve moved things forward quite a bit and have much improved 64 and 32 bit images. The main site to find all the AMI details is:

      http://cloudbiolinux.org/

      Amazon extra-large images with the 64bit AMI sound like they’ll work for your needs. Let us know if you run into any issues or have other feedback.

      I need to write up a new post linking to all the updated work so it’s easier to find. Thanks much for the thoughts.

      Brad Chapman

      August 31, 2010 at 12:36 pm

      • Brad,
        Thanks for the quick reply – the CloudBioLinux site seems to have just what I needed.

        Ross Whetten

        September 1, 2010 at 8:50 pm

  6. Hi Brad,
    Nice work. I recently started following this project, seems to me that at some point you want to branch out the various focused versions, for instance separate builds for structural biology, NGS etc. Is there any plan to have that kind of branching? Also I am very much interested to have some systems biology focused add-ons with current build system, such as tools for modelling and simulation. I am very keen to contribute in this direction. In addition, I have request cum suggestion that if you can host a separate repository for this project on Gihub, at the moment I have to fork all you projects which is nice but you want to keep things separate. As someone coming form BitBucket may be I am missing something here?

    Abhishek

    September 3, 2010 at 6:30 am

  7. Abhishek;
    It would be great to have some systems biology libraries in the image; glad to have you interested and let us know what we can do to help.

    The configuration format supports a high level organization by types of packages and programming libraries:

    https://github.com/chapmanb/cloudbiolinux/blob/master/config/main.yaml

    so that it’s easy to create custom images with only a subset of the full functionality. Right now the plans is to maintain a single central image with everything, but the option is there to have separate distributions in the future if folks are interested in it.

    It would be a good idea to have the code in a separate repository. This grew out of a small project, hence being embedded in the bigger catch-all
    repository. Now we’ve got the problem of having links to the current code base on the web, but maybe we should think about moving it sooner rather than
    later.

    Brad Chapman

    September 3, 2010 at 7:35 am

  8. You good describe this problem. I hope I will help soon…

    Rohn

    October 20, 2010 at 8:24 am


Leave a comment