Distributed exome analysis pipeline with CloudBioLinux and CloudMan
A major challenge in building analysis pipelines for next-generation sequencing data is combining a large number of processing steps in a flexible, scalable manner. Current best-practice software needs to be installed and configured alongside the custom code to chain individual programs together. Scaling to handle increasing throughput requires running that custom code on a wide variety of parallel architectures, from single multicore machines to heterogeneous clusters.
Establishing community resources that meet the challenges of building these pipelines ensures that bioinformatics programmers can share the burden of building large scale systems. Two open-source efforts which aim at providing this type of architecture are:
CloudBioLinux — A community effort to create shared images filled with bioinformatics software and libraries, using an automated build environment.
CloudMan — Uses CloudBioLinux as a platform to build a full SGE cluster environment. Written by Enis Afgan and the Galaxy Team, CloudMan is used to provide a ready-to-run, dynamically scalable version of Galaxy on Amazon AWS.
Here we combine CloudBioLinux software with a CloudMan SGE cluster to build a fully automated pipeline for processing high throughput exome sequencing data:
- The underlying analysis software is from CloudBioLinux.
- CloudMan provides an SGE cluster managed via a web front end.
- RabbitMQ is used for communication between cluster nodes.
- An automated pipeline, written in Python, organizes parallel processing across the cluster.
Start cluster with CloudBioLinux and CloudMan
Start in the Amazon web console, a convenient front end for managing EC2 servers. The first step is to follow the CloudMan setup instructions to create an Amazon account and set up appropriate security groups and user data. The wiki page contains detailed screencasts. Below is a short screencast showing how to boot your CloudBioLinux specific CloudMan server:
Once this is booted, proceed to the CloudMan web interface on the server and startup an instance from this shared identifier:
This screencast shows all of the details, including starting an additional node on the SGE cluster:
Configure AMQP messaging
Edit: The AMQP messaging steps have now been full automated so the configuration steps in this section are no longer required. Skip down to the ‘Run Analysis’ section to start processing the data immediately.
With your server booted and ready to run, the next step is to configure RabbitMQ messaging to communicate between nodes on your cluster. In the AWS console, find the external and internal hostname of the head machine. Start by opening an ssh connection to the machine with the external hostname:
$ ssh -i your-keypair firstname.lastname@example.org
/export/data/galaxy/universe_wsgi.ini configuration file to add the internal hostname. After editing, the AMQP section will look like:
[galaxy_amqp] host = ip-10-125-10-182.ec2.internal port = 5672 userid = biouser password = tester
Finally, add the user and virtual host to the running RabbitMQ server on the master node with 3 commands:
$ sudo rabbitmqctl add_user biouser tester creating user "biouser" ... ...done. $ sudo rabbitmqctl add_vhost bionextgen creating vhost "bionextgen" ... ...done. $ sudo rabbitmqctl set_permissions -p bionextgen biouser ".*" ".*" ".*" setting permissions for user "biouser" in vhost "bionextgen" ... ...done.
With messaging in place, we are ready to run the analysis.
/export/data contains a ready to run example exome analysis, with FASTQ input files in
/export/data/exome_example/fastq and configuration information in
/export/data/exome_example/config. Start the fully automated pipeline with a single command:
$ cd /export/data/work $ distributed_nextgen_pipeline.py /export/data/galaxy/post_process.yaml /export/data/exome_example/fastq /export/data/exome_example/config/run_info.yaml
distributed_nextgen_pipeline.py starts processing servers on each of the cluster nodes, using SGE for scheduling. Then a top level analysis server runs, splitting the FASTQ data across the nodes at each step of the process:
- Alignment with BWA
- Preparation of merged alignment files with Picard
- Recalibration and realignment with GATK
- Variant calling with GATK
- Assessment of predicted variant effects with snpEff
- Preparation of summary PDFs for each sample with read details from FastQC alongside alignment, hybrid selection and variant calling statistics from Picard
Monitor the running process
The example data is from a human chromosome 22 hybrid selection experiment. While running, you can keep track of the progress in several ways. SGEs
qstat command will tell you where the analysis servers are running on the cluster:
$ qstat ob-ID prior name user state submit/start at queue ---------------------------------------------------------------------------------- 1 0.55500 nextgen_an ubuntu r 08/14/2011 18:16:32 email@example.com 2 0.55500 nextgen_an ubuntu r 08/14/2011 18:16:32 firstname.lastname@example.org 3 0.55500 automated_ ubuntu r 08/14/2011 18:16:47 email@example.com
Listing files in the working directory will show our progress:
$ cd /export/data/work $ ls -lh drwxr-xr-x 2 ubuntu ubuntu 4.0K 2011-08-13 21:09 alignments -rw-r--r-- 1 ubuntu ubuntu 2.0K 2011-08-13 21:17 automated_initial_analysis.py.o11 drwxr-xr-x 2 ubuntu ubuntu 33 2011-08-13 20:43 log -rw-r--r-- 1 ubuntu ubuntu 15K 2011-08-13 21:17 nextgen_analysis_server.py.o10 -rw-r--r-- 1 ubuntu ubuntu 15K 2011-08-13 21:17 nextgen_analysis_server.py.o9 drwxr-xr-x 8 ubuntu ubuntu 102 2011-08-13 21:06 tmp
The files that end with
.o* are log files from each of the analysis servers and provide detailed information about the current state of processing at each server:
$ less nextgen_analysis_server.py.o10 INFO: nextgen_pipeline: Processing sample: Test replicate 2; lane 8; reference genome hg19; researcher ; analysis method SNP calling INFO: nextgen_pipeline: Aligning lane 8_100326_FC6107FAAXX with bwa aligner INFO: nextgen_pipeline: Combining and preparing wig file [u'', u'Test replicate 2'] INFO: nextgen_pipeline: Recalibrating [u'', u'Test replicate 2'] with GATK
The processing pipeline results in numerous intermediate files. These take up a lot of disk space and are not necessary after processing is finished. The final step in the process is to extract the useful files for visualization and further analysis:
$ upload_to_galaxy.py /export/data/galaxy/post_process.yaml /export/data/exome_example/fastq /export/data/work /export/data/exome_example/config/run_info.yaml
For each sample, this script copies:
- A BAM file with aligned sequeneces and original FASTQ data
- A realigned and recalibrated BAM file, ready for variant calling
- Variant calls in VCF format.
- A tab delimited file of predicted variant effects.
- A PDF summary file containing alignment, variant calling and hybrid selection statistics.
into an output directory for the flowcell:
$ ls -lh /export/data/galaxy/storage/100326_FC6107FAAXX/7 -rw-r--r-- 1 ubuntu ubuntu 38M 2011-08-19 20:50 7_100326_FC6107FAAXX.bam -rw-r--r-- 1 ubuntu ubuntu 22M 2011-08-19 20:50 7_100326_FC6107FAAXX-coverage.bigwig -rw-r--r-- 1 ubuntu ubuntu 72M 2011-08-19 20:51 7_100326_FC6107FAAXX-gatkrecal.bam -rw-r--r-- 1 ubuntu ubuntu 109K 2011-08-19 20:51 7_100326_FC6107FAAXX-snp-effects.tsv -rw-r--r-- 1 ubuntu ubuntu 827K 2011-08-19 20:51 7_100326_FC6107FAAXX-snp-filter.vcf -rw-r--r-- 1 ubuntu ubuntu 1.6M 2011-08-19 20:50 7_100326_FC6107FAAXX-summary.pd
As suggested by the name, the script can also integrate the data into a Galaxy instance if desired. This allows biologists to perform further data analysis, including visual inspection of the alignments in the UCSC browser.
All components of the pipeline are open source and part of community projects. CloudMan, CloudBioLinux and the pipeline are customized through YAML configuration files. Combined with the CloudMan managed SGE cluster, the pipeline can be applied in parallel to any number of samples.
The overall goal is to share the automated infrastructure work that moves samples from sequencing to being ready for analysis. This allows biologists more rapid access to the processed data, focusing attention on the real work: answering scientific questions.
If you’d like to hear more about CloudBioLinux, CloudMan and the exome sequencing pipeline, I’ll be discussing it at the AWS Genomics Event in Seattle on September 22nd.