Making next-generation sequencing analysis pipelines easier with BioCloudCentral and Galaxy integration

My previous post described running an automated exome pipeline using CloudBioLinux and CloudMan, and generated incredibly useful feedback. Comments and e-mails pointed out potential points of confusion for new users deploying the process on custom data. I also had the chance to get hands on with researchers running CloudBioLinux and CloudMan during the AWS Genomics Event (talk slides are available).

The culmination of all this feedback are two new development projects from the CloudBioLinux community, aimed at making it easier to run custom analysis pipelines:

BioCloudCentral — A web service that launches CloudBioLinux and CloudMan clusters on Amazon Web Services hardware. This removes all of the manual steps involved in setting up security groups and launching a CloudBioLinux instance. A user only needs to sign up for an AWS account; BioCloudCentral takes care of everything else.
A custom Galaxy integrated front-end to next-generation sequencing pipelines. A jQuery UI wizard interface manages the intake of sequences and specification of parameters. It runs an automated backend processing pipeline with the structured input data, uploading results into Galaxy data libraries for additional analysis.

Special thanks are due to Enis Afgan for his help building these tools. He provided his boto expertise to the BioCloudCentral Amazon interaction, and generalized CloudMan to support the additional flexibility and automation on display here.

This post describes using these tools to start a CloudMan instance, create an SGE cluster and run a distributed variant calling analysis, all from the browser. The behind the scene details described earlier are available: the piepline uses a CloudBioLinux image containing a wide variety of bioinformatics software and you can use ssh or an NX graphical client to connect to the instance. This is the unique approach behind CloudBioLinux and CloudMan: they provide an open framework for building automated, easy-to-use workflows.

BioCloudCentral — starting a CloudBioLinux instance

To get started, sign up for an Amazon Web services account. This gives you access to on demand computing where you pay per hour of usage. Once signed up, you will need your Access Key ID and Secret Access Key from the Amazon security credentials page.

With these, navigate to BioCloudCentral and fill out the simple entry form. In addition to your access credentials, enter your choice of a name used to identify the cluster, and your choice of password to access the CloudMan web interface and the cluster itself via ssh or NX.

Clicking submit launches a CloudBioLinux server on Amazon. Be careful, since you are now paying per hour for your machine; remember to shut it down when finished.

Before leaving the monitoring page, you want to download a pre-formatted user-data file; this allows you to later start the same CloudMan instance directly from the Amazon web services console.

CloudMan — managing the cluster

The monitoring page on BioCloudCentral provides links directly to the CloudMan web interface. On the welcome page, start a shared CloudMan instance with this identifier:

cm-b53c6f1223f966914df347687f6fc818/shared/2012-07-23--19-23/

This shared instance contains the custom Galaxy interface we will use, along with FASTQ sequence files for demonstration purposes. CloudMan will start up the filesystem, SGE, PostgreSQL and Galaxy. Once launched, you can use the CloudMan interface to add additional machines to your cluster for processing.

Galaxy pipeline interface — running the analysis

This Galaxy instance is a fork of the main codebase containing a custom pipeline interface in addition to all of the standard Galaxy tools. It provides an intuitive way to select FASTQ files for processing. Login with the demonstration account (user: example@example.com; password: example) and load FASTQ files along with target and bait BED files into your active history. Then work through the pipeline wizard step by step to start an analysis:

The Galaxy interface builds a configuration file describing the parameters and inputs, and submits this to the backend analysis server. This server kicks off processing, distributing the analysis across the SGE cluster. For the test data, processing will take approximately 4 hours on a cluster with a single additional work node (Large instance type).

Galaxy — retrieving and displaying results

The analysis pipeline uploads the finalized results into Galaxy data libraries. For this demonstration, the example user has results from a previous run in the data library so you don’t need to wait for the analysis to finish. This folder contains alignment data in BAM format, coverage information in BigWig format, a VCF file of variant calls, a tab separate file with predicted variant effects, and a PDF file of summary information. After importing these into your active Galaxy history, you can perform additional analysis on the data, including visualization in the UCSC genome browser:

As a reminder, don’t forget to terminate your cluster when finished. You can do this either from the CloudMan web interface or the Amazon console.

Analysis pipeline details and extending this work

The backend analysis pipeline is a freely available set of Python modules included on the CloudBioLinux AMI. The pipeline closely follows current best practice variant detection recommendations from the Broad GATK team:

FASTQ alignment with BWA; source code
Base quality score recalibration with GATK: source code
Local realignment around indels with GATK: source code:
Variant calling (SNPs and indels) using the GATK Unified Genotyper: source code
Variant effect estimation with snpEff: source code
Read coverage visualization with wigToBigWig: source code

The pipeline framework design is general, allowing incorporation of alternative aligners or variant calling algorithms.

We hope that in addition to being directly useful, this framework can fit within the work environments of other developers. The flexible toolkit used is: CloudBioLinux with open source bioinformatics libraries, CloudMan with a managed SGE cluster, Galaxy with a custom pipeline interface, and finally Python to parallelize and manage the processing. We invite you to fork and extend any of the different components. Thank you again to everyone for the amazing feedback on the analysis pipeline and CloudBioLinux.

Written by Brad Chapman

November 29, 2011 at 8:50 pm

Posted in analysis

Tagged with bioinformatics, cloud-computing, galaxy, ngs

15 Responses

Subscribe to comments with RSS.

Brad, as usual, awesome piece of work, thank you for that ! :)

I have a general “goal” question: What is the added value of the pipeline+custom_galaxy vs constructing standard Galaxy workflow(s) and potentially calling them via the Galaxy API ? I thought the bcbio-nextgen custom pipeline goal was more in the “preliminar analysis” arena (a.k.a, getting as many analysis done as possible from the sequencers automatically & quickly).

An obvious advantage I see is the parallelism that bcbio-nextgen has already built-in (multi-core, multi-node runners), but other than that, it feels like replicating workflow systems, which will be hard to integrate upstream, isn’t it ? Please, correct me if I’m wrong !

brainstorm

November 30, 2011 at 4:10 am

Reply
- Roman;
  The goal of this is still push-button best-practice analysis, identical to the automated LIMS work. What this enables beyond that is an easier way to run these pipelines if you have FASTQ files to analyze and are not in an automated environment.
  
  This is designed as a complement to Galaxy’s existing workflow and tool running capabilities. It’s providing a lower level of abstraction so if you know how to program you can do things that would be more difficult in the graphical workflows: parallelization, cleanup of intermediate files, substituting in alternative aligners and variant callers.
  
  The difference between this and Galaxy workflows and tools is in the user experience. It gives the user less control beyond top level configurable parameters by providing a more best-practice analysis. If you want to tweak a specific parameter in GATK then workflows are the right way; if you want variant calls to get started with, then this tool is more appropriate. The way I envision it being used is to first run a pipeline, examine the results, then explore and tweak with Galaxy tools as desired.
  
  Brad Chapman
  
  November 30, 2011 at 9:47 am
  
  Reply
Hi Brad,
I like this tutorial very much. The one-click method for creating an on-demand cluster is quite neat. I noticed there was an option for autoscaling. Where can we get more information about these general features? I would probably like to try using a cloudman API if it exists to automatically launch new nodes and add them to the cluster. I wonder how difficult that would be to do with this interface.
What I think would be nice and beneficial to the community is if we could get some indicative usage costs on Amazon of running e.g. 1 Large instance with the exome pipeline + 20M read pairs. I think this is a familiar number for exome sequencing and it could be run on a 1KG published samples. Most people are still wary of the cost for such a system and that’s usually a barrier to running it.

zayedi

December 9, 2011 at 6:12 am

Reply
- Zayed;
  The autoscaling and cluster node addition is all handled by CloudMan. Enis would be a better person to ask about the details but the code is available. For instance, here is the code for adding a new node:
  
  https://bitbucket.org/galaxy/cloudman/src/9f48ad20b612/cm/services/apps/sge.py#cl-148
  
  I agree benchmarks and costs would be useful. I’m still working on optimizing the pipeline but would be happy to help anyone working on this.
  Brad
  
  Brad Chapman
  
  December 9, 2011 at 12:38 pm
  
  Reply
- Zayed;
  
  Regarding your question about costs, you can use the AWS costs calculator:
  
  http://calculator.s3.amazonaws.com/calc5.html
  
  You can also have a look at the following recent paper from Harvard Medical School.:
  
  http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1002147
  
  Comes in handy when estimating costs on the Amazon cloud for biomedical purposes.
  
  Spoiler: ~330$ for a 304CPU cluster running for 10hours.
  
  Hope that helps !
  
  brainstorm
  
  December 9, 2011 at 2:10 pm
  
  Reply
I forgot to add this in my last comment. Is it possible to customize the underlying AMI that cloudman boots up. I noticed you used a shared cluster identifier which Im guessing is an EBS volume snapshot. I guess I’m thinking about a way to get this booted up, shell into the instance and start adding programs or attaching data, then saving my work.

zayedi

December 9, 2011 at 6:18 am

Reply
- Zayed;
  Absolutely. This was built completely from available options anyone can use. You can customize in 3 different ways:
  
  – Use existing CloudMan share-a-cluster functionality, which is what I do here. You add all of the changes to the mounted filesystem and then CloudMan creates a snapshot and identifier that others can instantiate.
  
  – Use post_start_script_url in the user-data that installs software or configures the system:
  
  http://wiki.g2.bx.psu.edu/Admin/Cloud/UserData
  
  that is what this uses to start up the processing server:
  
  https://github.com/chapmanb/cloudbiolinux/blob/master/utils/cbl_exome_setup.py
  
  – Create a custom AMI derived from CloudBioLinux and boot from that.
  
  Brad
  
  Brad Chapman
  
  December 9, 2011 at 12:43 pm
  
  Reply
Incredible work Brad and Enis!

I’m interested in porting CloudMan and BioCloudCentral to OpenStack. Either of you tried this or know anyone who has?

Thanks!

Glen

Glen Otero

December 12, 2011 at 1:05 pm

Reply
- Glen;
  Awesome. It hasn’t been done yet, but both Enis and Ntino are very interested in both OpenStack and Eucalyptus compatibility. I’ll send everyone an e-mail so you have their contacts; they should be able to provide more background about where the ports are at.
  
  Brad Chapman
  
  December 13, 2011 at 6:09 am
  
  Reply
Great job Brad.
I tried the one click installation and noticed that the /mnt/GalaxyTools directory is completely full so that adding new tools is impossible. also the update through the admin console fails since it only has 1.5M left on the tools section.
I’m OKish with creating a new disk, copying it over and re-mounting etc. but was hoping for an easier way than playing LInux administrator…

Anyway this can be done more easily? Expaning the GalaxyTool directory (or others?)

Thanks

Thon Deboer

February 8, 2012 at 2:45 pm

Reply
- Never mind…I missed the part of cloning the instance and that did not have the same problem.
  It seems that the official cloudman version has /mnt/GalaxyTools without any space, not the special cloudbiolinux version.
  I still can’t get cloudbiolinux to work though, since when I run the pipeline, no data is showing up in the shared library and I also can’t seem the admin console, even after adding myself as an admin…Anyone been able to run this?
  
  Thon Deboer
  
  February 9, 2012 at 6:00 pm
  
  Reply
  - Thon,
    Thanks for trying this out and reporting the issues. I’ve sent a mail to Enis about the default CloudMan Galaxy space issues and cc’ed you. He can hopefully default these to have additional space to prevent those problems.
    
    For your other issues, are you using the shared instance (the cm-* identifier) from this post? If so, there should be inputs and outputs pre-populated in the Data Libraries. Are you not seeing any of those? Did you login as the example user (example@example.com)?
    
    For the admin issue, did you restart Galaxy after adding your e-mail to the list of admin_users? I know it needs a reboot after doing that.
    
    Hope this helps.
    
    Brad Chapman
    
    February 9, 2012 at 9:11 pm
    
    Reply
hi,

I am going into a hackaton using cloudbiolinux and cloudman on aws this weekend

we will be running iddm1 and tcf7l2 test on the 1000 genomes on aws.

We are using perl scripts to scan for iddm1 and tcf7l2.

We would need to finish the scan in 6 hours

I will setup the perl script to sequentially scan thru each genome on each instance.

If I need 10-100 instances to finish the task

How can I setup cloudman to spilt the task for each of these 10-100 instances ?

Or I will need to develop another script to handle this ?

Any idea/ suggestion help is greatly appreciated as we critically short of time and cannot afford any mistake.

Thanks

Dennis

dennis yar

June 21, 2012 at 7:54 pm

Reply
- Dennis;
  That sounds like a fun hackathon. Your approach makes sense as long as your perl script can independently process each of the input samples. The approach I’d take within CloudMan and Galaxy is:
  
  – Start a CloudMan Galaxy cluster with the number of nodes you’d like to run.
  – Upload input samples into Galaxy.
  – Prepare a Galaxy workflow that incorporates your tool, processing an individual sample.
  – Run the workflow in batch on the input samples: http://dev.list.galaxyproject.org/Looking-for-recommendations-How-to-run-galaxy-workflows-in-batch-td4362836.html
  – CloudMan/SGE will distribute these over your cluster so watch them run.
  – Collect results and finalize.
  
  Hope this helps.
  
  Brad Chapman
  
  June 21, 2012 at 9:49 pm
  
  Reply
  - Hi Brad,
    
    Thank you.
    
    Will try it out.
    
    Dennis
    
    dennis yar
    
    June 22, 2012 at 1:49 am
    
    Reply

Blue Collar Bioinformatics

15 Responses

Leave a comment Cancel reply

Recent Posts

Blue Collar Bioinformatics