Blue Collar Bioinformatics

Note: new posts have moved to http://bcb.io/ Please look there for the latest updates and comments

Distributed exome analysis pipeline with CloudBioLinux and CloudMan

with 19 comments

A major challenge in building analysis pipelines for next-generation sequencing data is combining a large number of processing steps in a flexible, scalable manner. Current best-practice software needs to be installed and configured alongside the custom code to chain individual programs together. Scaling to handle increasing throughput requires running that custom code on a wide variety of parallel architectures, from single multicore machines to heterogeneous clusters.

Establishing community resources that meet the challenges of building these pipelines ensures that bioinformatics programmers can share the burden of building large scale systems. Two open-source efforts which aim at providing this type of architecture are:

  • CloudBioLinux — A community effort to create shared images filled with bioinformatics software and libraries, using an automated build environment.

  • CloudMan — Uses CloudBioLinux as a platform to build a full SGE cluster environment. Written by Enis Afgan and the Galaxy Team, CloudMan is used to provide a ready-to-run, dynamically scalable version of Galaxy on Amazon AWS.

Here we combine CloudBioLinux software with a CloudMan SGE cluster to build a fully automated pipeline for processing high throughput exome sequencing data:

  • The underlying analysis software is from CloudBioLinux.
  • CloudMan provides an SGE cluster managed via a web front end.
  • RabbitMQ is used for communication between cluster nodes.
  • An automated pipeline, written in Python, organizes parallel processing across the cluster.

Below are instructions for starting a cluster on Amazon EC2 resources to run an exome sequencing pipeline that processes FASTQ sequencing reads, producing fully annotated variant calls.

Start cluster with CloudBioLinux and CloudMan

Start in the Amazon web console, a convenient front end for managing EC2 servers. The first step is to follow the CloudMan setup instructions to create an Amazon account and set up appropriate security groups and user data. The wiki page contains detailed screencasts. Below is a short screencast showing how to boot your CloudBioLinux specific CloudMan server:

Once this is booted, proceed to the CloudMan web interface on the server and startup an instance from this shared identifier:

cm-b53c6f1223f966914df347687f6fc818/shared/2012-07-23--19-23

This screencast shows all of the details, including starting an additional node on the SGE cluster:


Configure AMQP messaging

Edit: The AMQP messaging steps have now been full automated so the configuration steps in this section are no longer required. Skip down to the ‘Run Analysis’ section to start processing the data immediately.

With your server booted and ready to run, the next step is to configure RabbitMQ messaging to communicate between nodes on your cluster. In the AWS console, find the external and internal hostname of the head machine. Start by opening an ssh connection to the machine with the external hostname:

$ ssh -i your-keypair ubuntu@ec2-50-19-177-134.compute-1.amazonaws.com

Edit the /export/data/galaxy/universe_wsgi.ini configuration file to add the internal hostname. After editing, the AMQP section will look like:

[galaxy_amqp]
host = ip-10-125-10-182.ec2.internal
port = 5672
userid = biouser
password = tester

Finally, add the user and virtual host to the running RabbitMQ server on the master node with 3 commands:

$ sudo rabbitmqctl add_user biouser tester
creating user "biouser" ...
...done.
$ sudo rabbitmqctl add_vhost bionextgen
creating vhost "bionextgen" ...
...done.
$ sudo rabbitmqctl set_permissions -p bionextgen biouser ".*" ".*" ".*"
setting permissions for user "biouser" in vhost "bionextgen" ...
...done.

Run analysis

With messaging in place, we are ready to run the analysis. /export/data contains a ready to run example exome analysis, with FASTQ input files in /export/data/exome_example/fastq and configuration information in /export/data/exome_example/config. Start the fully automated pipeline with a single command:

 $ cd /export/data/work
 $ distributed_nextgen_pipeline.py /export/data/galaxy/post_process.yaml
                                   /export/data/exome_example/fastq
                                   /export/data/exome_example/config/run_info.yaml

distributed_nextgen_pipeline.py starts processing servers on each of the cluster nodes, using SGE for scheduling. Then a top level analysis server runs, splitting the FASTQ data across the nodes at each step of the process:

  • Alignment with BWA
  • Preparation of merged alignment files with Picard
  • Recalibration and realignment with GATK
  • Variant calling with GATK
  • Assessment of predicted variant effects with snpEff
  • Preparation of summary PDFs for each sample with read details from FastQC alongside alignment, hybrid selection and variant calling statistics from Picard

Monitor the running process

The example data is from a human chromosome 22 hybrid selection experiment. While running, you can keep track of the progress in several ways. SGEs qstat command will tell you where the analysis servers are running on the cluster:

$ qstat
ob-ID  prior   name   user  state submit/start at   queue
----------------------------------------------------------------------------------
1 0.55500 nextgen_an ubuntu  r  08/14/2011 18:16:32 all.q@ip-10-125-10-182.ec2.int
2 0.55500 nextgen_an ubuntu  r  08/14/2011 18:16:32 all.q@ip-10-86-254-105.ec2.int
3 0.55500 automated_ ubuntu  r  08/14/2011 18:16:47 all.q@ip-10-125-10-182.ec2.int

Listing files in the working directory will show our progress:

$ cd /export/data/work
$ ls -lh
drwxr-xr-x 2 ubuntu ubuntu 4.0K 2011-08-13 21:09 alignments
-rw-r--r-- 1 ubuntu ubuntu 2.0K 2011-08-13 21:17 automated_initial_analysis.py.o11
drwxr-xr-x 2 ubuntu ubuntu   33 2011-08-13 20:43 log
-rw-r--r-- 1 ubuntu ubuntu  15K 2011-08-13 21:17 nextgen_analysis_server.py.o10
-rw-r--r-- 1 ubuntu ubuntu  15K 2011-08-13 21:17 nextgen_analysis_server.py.o9
drwxr-xr-x 8 ubuntu ubuntu  102 2011-08-13 21:06 tmp

The files that end with .o* are log files from each of the analysis servers and provide detailed information about the current state of processing at each server:

$ less nextgen_analysis_server.py.o10
INFO: nextgen_pipeline: Processing sample: Test replicate 2; lane
  8; reference genome hg19; researcher ; analysis method SNP calling
INFO: nextgen_pipeline: Aligning lane 8_100326_FC6107FAAXX with bwa aligner
INFO: nextgen_pipeline: Combining and preparing wig file [u'', u'Test replicate 2']
INFO: nextgen_pipeline: Recalibrating [u'', u'Test replicate 2'] with GATK

Retrieve results

The processing pipeline results in numerous intermediate files. These take up a lot of disk space and are not necessary after processing is finished. The final step in the process is to extract the useful files for visualization and further analysis:

$ upload_to_galaxy.py /export/data/galaxy/post_process.yaml
                      /export/data/exome_example/fastq
                      /export/data/work
                      /export/data/exome_example/config/run_info.yaml

For each sample, this script copies:

  • A BAM file with aligned sequeneces and original FASTQ data
  • A realigned and recalibrated BAM file, ready for variant calling
  • Variant calls in VCF format.
  • A tab delimited file of predicted variant effects.
  • A PDF summary file containing alignment, variant calling and hybrid selection statistics.

into an output directory for the flowcell: /export/data/galaxy/storage/100326_FC6107FAAXX:

$ ls -lh /export/data/galaxy/storage/100326_FC6107FAAXX/7
-rw-r--r-- 1 ubuntu ubuntu  38M 2011-08-19 20:50 7_100326_FC6107FAAXX.bam
-rw-r--r-- 1 ubuntu ubuntu  22M 2011-08-19 20:50 7_100326_FC6107FAAXX-coverage.bigwig
-rw-r--r-- 1 ubuntu ubuntu  72M 2011-08-19 20:51 7_100326_FC6107FAAXX-gatkrecal.bam
-rw-r--r-- 1 ubuntu ubuntu 109K 2011-08-19 20:51 7_100326_FC6107FAAXX-snp-effects.tsv
-rw-r--r-- 1 ubuntu ubuntu 827K 2011-08-19 20:51 7_100326_FC6107FAAXX-snp-filter.vcf
-rw-r--r-- 1 ubuntu ubuntu 1.6M 2011-08-19 20:50 7_100326_FC6107FAAXX-summary.pd

As suggested by the name, the script can also integrate the data into a Galaxy instance if desired. This allows biologists to perform further data analysis, including visual inspection of the alignments in the UCSC browser.

Learn more

All components of the pipeline are open source and part of community projects. CloudMan, CloudBioLinux and the pipeline are customized through YAML configuration files. Combined with the CloudMan managed SGE cluster, the pipeline can be applied in parallel to any number of samples.

The overall goal is to share the automated infrastructure work that moves samples from sequencing to being ready for analysis. This allows biologists more rapid access to the processed data, focusing attention on the real work: answering scientific questions.

If you’d like to hear more about CloudBioLinux, CloudMan and the exome sequencing pipeline, I’ll be discussing it at the AWS Genomics Event in Seattle on September 22nd.

Written by Brad Chapman

August 19, 2011 at 5:33 pm

19 Responses

Subscribe to comments with RSS.

  1. Thanks so much for the great walkthrough — I’m really looking forward to testing this out. Being new to AWS, I was wondering if you could say a bit more about what costs to expect for this type of application. It’s fairly clear how much the various instances cost from the Amazon EC2 pricing page, but do you always keep an instance running Galaxy or do you just start an instance to align a FASTQ file and then stop it? What about upload & download of FASTQ and output files — are there significant costs associated with that? And how do you handle long-term storage? Do you use S3 (which seems fairly expensive) or just store locally?

    Shaun

    August 20, 2011 at 10:00 pm

    • Shaun;
      Thanks much. In terms of cost the three major ones are the instances, storage and transfer. I run instances as needed for the processing, and then stop them when finished; this is by far the most cost effective way unless you need a publicly available server. To persist data between runs while in progress, I keep it on EBS stores; those are 10 cents per GB per month. CloudMan allows you to partition a separate EBS store and expand it as needed, which makes this easy. This allows you to terminate your clusters at will, saving on instance costs, and pick up right where you left off. Transfer is 10 cents per Gb out and free going in. Pulling out just the files you need, as I do in the ‘Retrieve results’ section, allows you to identify what you actually need long term and transfer that only when finished processing. For longer term storage that needs to be Amazon accessible, I use S3. Since I often have a local copy, reduced redundancy storage (http://aws.amazon.com/s3/#protecting) is a good option. If you are finished with something and just need to keep it backed up, transferring back to local hardware works well. A machine with a decent amount of disk is not too expensive; where Amazon provides a major advantage is in on-demand compute and scale. Titus Brown has a nice blog post about NGS computational resources which provides some numbers for compute and storage: http://ivory.idyll.org/blog/jul-11/how-much-compute-ngs.html

      My best recommendation is to try running some sample work and keep close track of the costs as you go. The billing gives you a detailed statement with broken down costs which makes it very easy to identify what the most expensive parts are and think of ways to minimize them. The major “oops” mistake people make early on is to accidentally leave instances running; as long as you avoid this by checking in the console when finishing processing you won’t be overwhelmed with a huge bill. Thanks again and hope this helps.

      Brad Chapman

      August 21, 2011 at 5:09 pm

      • Thanks for the reply Brad – very helpful.

        Shaun

        August 21, 2011 at 5:27 pm

  2. Good work, Brad! Are you using this on EC2? I would have thought there would be privacy concerns over medical data.

    Jeremy Leipzig

    August 23, 2011 at 10:31 am

    • Thanks Jeremy. All of the write up is for EC2, although the underlying distributed exome pipeline works on standard cluster systems and single multicore machines as well. We are using it on a more traditional NFS managed cluster. The big advantage of EC2 is that you can give the exact setup and data to reproduce and build off of; it’s frustrating to start a set of instructions with: “first, buy a cluster”.

      We use both a local cluster and EC2 depending on the project data. Personally, I feel like many of the security concerns of EC2 versus local machines are overstated but I am not a lawyer so work wherever best suits the data and collaborators.

      Brad Chapman

      August 23, 2011 at 3:13 pm

  3. Brad, how much sequence was involved in this analysis? I’m guessing ~19 Mbases…

    Gregg TeHennepe

    August 23, 2011 at 12:40 pm

  4. Gregg;
    This is a set of demonstration data designed to run relatively quickly and not take up a lot of space, so is a subset of reads from chromosome 22. It’s two replicate samples of ~300,000 76bp reads.

    Brad Chapman

    August 23, 2011 at 3:16 pm

  5. Great post (really). I have a Q regarding your experience running this on Amazon: do you experience temporary loss of connectivity when running intensive processes? Possibly specific to processes that involve a lot of I/O? I certainly do, though with a my own analytical set-up, and am wondering how widespread this may be…

    Yannick Pouliot

    September 11, 2011 at 8:27 pm

  6. Thx Brad! I’m using a High-Memory Extra Large instance, and the phenomenon is reproducible. Any idea what instance types are best for short-read analysis? Would be an interesting survey to run…

    Btw, the I/O performance of HMEL instances is described as “moderate”, and I’m starting to think that this is where the issue lies, as the Large instance has “high” performance I/O…

    Yannick Pouliot

    September 12, 2011 at 1:00 pm

    • Yannick;
      I’d be very interested to hear what you come up with if you dig into the image differences more. I haven’t done any benchmarking myself, so it would be cool to see.

      Brad Chapman

      September 12, 2011 at 8:47 pm

  7. Hi Brad

    Very nice tutorial. So Cloudman is the system that brings the cluster on Amazon to life. RabbitMQ is a python library to send messages across to cluster nodes to do their work for alignment, etc.
    I’m quite used to the whole qsub paradigm and I assume that if we went up to the part before “Run Analysis” with Cloudman, it would then be a case of just qsubbing shell scripts to the SGE cluster.
    Is that correct?

    zayedi

    December 3, 2011 at 12:45 am

    • Zayed;
      That’s exactly right. CloudMan configures a standard SGE cluster with an LSF shared filesystem, and then gives you a web interface to add and remove nodes as needed. If you ssh into the instance, you can run jobs with qsub as you normally would on any other cluster.

      The analysis script for this pipeline interacts with SGE in exactly this way. It starts worker jobs with qsub and then sends out messages for processing to them with RabbitMQ.

      Brad Chapman

      December 4, 2011 at 2:33 pm

  8. […] blue collar bioinformatics’ where he describes the progress being made in online accessible cloud instances (private to you) of analytical environments for bioinformatics. Look for BioCloud Central where […]

  9. […] claim as my own work. After a few hours’ research, some early contenders were Bio-Linux and CloudBioLinux, but I was scared off by their size. I wanted something that most anybody could run with minimal […]

  10. Hi Brad,

    Do you know if anyone has tried to deploy bcbio on the Microsoft Azure Cloud?

    Matt

    Matt Bawn

    July 18, 2014 at 12:50 pm

    • Matt;
      We did some initial work on this a year or so ago, using their linux machines. It runs fine on single multicore instances. We didn’t do additional work on parallelizing because at the time there weren’t many options available for managing a cluster of machines. The situation has probably changed since then but we haven’t reinvestigated recently. Hope this helps some.

      Brad Chapman

      July 19, 2014 at 6:32 am


Leave a comment