Blue Collar Bioinformatics

Note: new posts have moved to http://bcb.io/ Please look there for the latest updates and comments

Making next-generation sequencing analysis pipelines easier with BioCloudCentral and Galaxy integration

with 15 comments

My previous post described running an automated exome pipeline using CloudBioLinux and CloudMan, and generated incredibly useful feedback. Comments and e-mails pointed out potential points of confusion for new users deploying the process on custom data. I also had the chance to get hands on with researchers running CloudBioLinux and CloudMan during the AWS Genomics Event (talk slides are available).

The culmination of all this feedback are two new development projects from the CloudBioLinux community, aimed at making it easier to run custom analysis pipelines:

  • BioCloudCentral — A web service that launches CloudBioLinux and CloudMan clusters on Amazon Web Services hardware. This removes all of the manual steps involved in setting up security groups and launching a CloudBioLinux instance. A user only needs to sign up for an AWS account; BioCloudCentral takes care of everything else.

  • A custom Galaxy integrated front-end to next-generation sequencing pipelines. A jQuery UI wizard interface manages the intake of sequences and specification of parameters. It runs an automated backend processing pipeline with the structured input data, uploading results into Galaxy data libraries for additional analysis.

Special thanks are due to Enis Afgan for his help building these tools. He provided his boto expertise to the BioCloudCentral Amazon interaction, and generalized CloudMan to support the additional flexibility and automation on display here.

This post describes using these tools to start a CloudMan instance, create an SGE cluster and run a distributed variant calling analysis, all from the browser. The behind the scene details described earlier are available: the piepline uses a CloudBioLinux image containing a wide variety of bioinformatics software and you can use ssh or an NX graphical client to connect to the instance. This is the unique approach behind CloudBioLinux and CloudMan: they provide an open framework for building automated, easy-to-use workflows.

BioCloudCentral — starting a CloudBioLinux instance

To get started, sign up for an Amazon Web services account. This gives you access to on demand computing where you pay per hour of usage. Once signed up, you will need your Access Key ID and Secret Access Key from the Amazon security credentials page.

With these, navigate to BioCloudCentral and fill out the simple entry form. In addition to your access credentials, enter your choice of a name used to identify the cluster, and your choice of password to access the CloudMan web interface and the cluster itself via ssh or NX.


Clicking submit launches a CloudBioLinux server on Amazon. Be careful, since you are now paying per hour for your machine; remember to shut it down when finished.

Before leaving the monitoring page, you want to download a pre-formatted user-data file; this allows you to later start the same CloudMan instance directly from the Amazon web services console.

CloudMan — managing the cluster

The monitoring page on BioCloudCentral provides links directly to the CloudMan web interface. On the welcome page, start a shared CloudMan instance with this identifier:

cm-b53c6f1223f966914df347687f6fc818/shared/2012-07-23--19-23/

This shared instance contains the custom Galaxy interface we will use, along with FASTQ sequence files for demonstration purposes. CloudMan will start up the filesystem, SGE, PostgreSQL and Galaxy. Once launched, you can use the CloudMan interface to add additional machines to your cluster for processing.


Galaxy pipeline interface — running the analysis

This Galaxy instance is a fork of the main codebase containing a custom pipeline interface in addition to all of the standard Galaxy tools. It provides an intuitive way to select FASTQ files for processing. Login with the demonstration account (user: example@example.com; password: example) and load FASTQ files along with target and bait BED files into your active history. Then work through the pipeline wizard step by step to start an analysis:


The Galaxy interface builds a configuration file describing the parameters and inputs, and submits this to the backend analysis server. This server kicks off processing, distributing the analysis across the SGE cluster. For the test data, processing will take approximately 4 hours on a cluster with a single additional work node (Large instance type).

Galaxy — retrieving and displaying results

The analysis pipeline uploads the finalized results into Galaxy data libraries. For this demonstration, the example user has results from a previous run in the data library so you don’t need to wait for the analysis to finish. This folder contains alignment data in BAM format, coverage information in BigWig format, a VCF file of variant calls, a tab separate file with predicted variant effects, and a PDF file of summary information. After importing these into your active Galaxy history, you can perform additional analysis on the data, including visualization in the UCSC genome browser:


As a reminder, don’t forget to terminate your cluster when finished. You can do this either from the CloudMan web interface or the Amazon console.

Analysis pipeline details and extending this work

The backend analysis pipeline is a freely available set of Python modules included on the CloudBioLinux AMI. The pipeline closely follows current best practice variant detection recommendations from the Broad GATK team:

The pipeline framework design is general, allowing incorporation of alternative aligners or variant calling algorithms.

We hope that in addition to being directly useful, this framework can fit within the work environments of other developers. The flexible toolkit used is: CloudBioLinux with open source bioinformatics libraries, CloudMan with a managed SGE cluster, Galaxy with a custom pipeline interface, and finally Python to parallelize and manage the processing. We invite you to fork and extend any of the different components. Thank you again to everyone for the amazing feedback on the analysis pipeline and CloudBioLinux.

Written by Brad Chapman

November 29, 2011 at 8:50 pm

Posted in analysis

Tagged with , , ,

15 Responses

Subscribe to comments with RSS.

  1. Brad, as usual, awesome piece of work, thank you for that ! :)

    I have a general “goal” question: What is the added value of the pipeline+custom_galaxy vs constructing standard Galaxy workflow(s) and potentially calling them via the Galaxy API ? I thought the bcbio-nextgen custom pipeline goal was more in the “preliminar analysis” arena (a.k.a, getting as many analysis done as possible from the sequencers automatically & quickly).

    An obvious advantage I see is the parallelism that bcbio-nextgen has already built-in (multi-core, multi-node runners), but other than that, it feels like replicating workflow systems, which will be hard to integrate upstream, isn’t it ? Please, correct me if I’m wrong !

    brainstorm

    November 30, 2011 at 4:10 am

    • Roman;
      The goal of this is still push-button best-practice analysis, identical to the automated LIMS work. What this enables beyond that is an easier way to run these pipelines if you have FASTQ files to analyze and are not in an automated environment.

      This is designed as a complement to Galaxy’s existing workflow and tool running capabilities. It’s providing a lower level of abstraction so if you know how to program you can do things that would be more difficult in the graphical workflows: parallelization, cleanup of intermediate files, substituting in alternative aligners and variant callers.

      The difference between this and Galaxy workflows and tools is in the user experience. It gives the user less control beyond top level configurable parameters by providing a more best-practice analysis. If you want to tweak a specific parameter in GATK then workflows are the right way; if you want variant calls to get started with, then this tool is more appropriate. The way I envision it being used is to first run a pipeline, examine the results, then explore and tweak with Galaxy tools as desired.

      Brad Chapman

      November 30, 2011 at 9:47 am

  2. Hi Brad,
    I like this tutorial very much. The one-click method for creating an on-demand cluster is quite neat. I noticed there was an option for autoscaling. Where can we get more information about these general features? I would probably like to try using a cloudman API if it exists to automatically launch new nodes and add them to the cluster. I wonder how difficult that would be to do with this interface.
    What I think would be nice and beneficial to the community is if we could get some indicative usage costs on Amazon of running e.g. 1 Large instance with the exome pipeline + 20M read pairs. I think this is a familiar number for exome sequencing and it could be run on a 1KG published samples. Most people are still wary of the cost for such a system and that’s usually a barrier to running it.

    zayedi

    December 9, 2011 at 6:12 am

  3. I forgot to add this in my last comment. Is it possible to customize the underlying AMI that cloudman boots up. I noticed you used a shared cluster identifier which Im guessing is an EBS volume snapshot. I guess I’m thinking about a way to get this booted up, shell into the instance and start adding programs or attaching data, then saving my work.

    zayedi

    December 9, 2011 at 6:18 am

  4. Incredible work Brad and Enis!

    I’m interested in porting CloudMan and BioCloudCentral to OpenStack. Either of you tried this or know anyone who has?

    Thanks!

    Glen

    Glen Otero

    December 12, 2011 at 1:05 pm

    • Glen;
      Awesome. It hasn’t been done yet, but both Enis and Ntino are very interested in both OpenStack and Eucalyptus compatibility. I’ll send everyone an e-mail so you have their contacts; they should be able to provide more background about where the ports are at.

      Brad Chapman

      December 13, 2011 at 6:09 am

  5. Great job Brad.
    I tried the one click installation and noticed that the /mnt/GalaxyTools directory is completely full so that adding new tools is impossible. also the update through the admin console fails since it only has 1.5M left on the tools section.
    I’m OKish with creating a new disk, copying it over and re-mounting etc. but was hoping for an easier way than playing LInux administrator…

    Anyway this can be done more easily? Expaning the GalaxyTool directory (or others?)

    Thanks

    Thon Deboer

    February 8, 2012 at 2:45 pm

    • Never mind…I missed the part of cloning the instance and that did not have the same problem.
      It seems that the official cloudman version has /mnt/GalaxyTools without any space, not the special cloudbiolinux version.
      I still can’t get cloudbiolinux to work though, since when I run the pipeline, no data is showing up in the shared library and I also can’t seem the admin console, even after adding myself as an admin…Anyone been able to run this?

      Thon Deboer

      February 9, 2012 at 6:00 pm

      • Thon,
        Thanks for trying this out and reporting the issues. I’ve sent a mail to Enis about the default CloudMan Galaxy space issues and cc’ed you. He can hopefully default these to have additional space to prevent those problems.

        For your other issues, are you using the shared instance (the cm-* identifier) from this post? If so, there should be inputs and outputs pre-populated in the Data Libraries. Are you not seeing any of those? Did you login as the example user (example@example.com)?

        For the admin issue, did you restart Galaxy after adding your e-mail to the list of admin_users? I know it needs a reboot after doing that.

        Hope this helps.

        Brad Chapman

        February 9, 2012 at 9:11 pm

  6. hi,

    I am going into a hackaton using cloudbiolinux and cloudman on aws this weekend

    we will be running iddm1 and tcf7l2 test on the 1000 genomes on aws.

    We are using perl scripts to scan for iddm1 and tcf7l2.

    We would need to finish the scan in 6 hours

    I will setup the perl script to sequentially scan thru each genome on each instance.

    If I need 10-100 instances to finish the task

    How can I setup cloudman to spilt the task for each of these 10-100 instances ?

    Or I will need to develop another script to handle this ?

    Any idea/ suggestion help is greatly appreciated as we critically short of time and cannot afford any mistake.

    Thanks

    Dennis

    dennis yar

    June 21, 2012 at 7:54 pm

    • Dennis;
      That sounds like a fun hackathon. Your approach makes sense as long as your perl script can independently process each of the input samples. The approach I’d take within CloudMan and Galaxy is:

      – Start a CloudMan Galaxy cluster with the number of nodes you’d like to run.
      – Upload input samples into Galaxy.
      – Prepare a Galaxy workflow that incorporates your tool, processing an individual sample.
      – Run the workflow in batch on the input samples: http://dev.list.galaxyproject.org/Looking-for-recommendations-How-to-run-galaxy-workflows-in-batch-td4362836.html
      – CloudMan/SGE will distribute these over your cluster so watch them run.
      – Collect results and finalize.

      Hope this helps.

      Brad Chapman

      June 21, 2012 at 9:49 pm

      • Hi Brad,

        Thank you.

        Will try it out.

        Dennis

        dennis yar

        June 22, 2012 at 1:49 am


Leave a comment