Blue Collar Bioinformatics

Benchmarking variation and RNA-seq analyses on Amazon Web Services with Docker

with 12 comments

Overview

We developed a freely available, easy to run implementation of bcbio-nextgen on Amazon Web Services (AWS) using Docker. bcbio is a community developed tool providing validated and scalable variant calling and RNA-seq analysis. The AWS implementation automates all of the steps of building a cluster, attaching high performance shared filesystems, and running an analysis. This makes bcbio readily available to the research community without the need to install and configure a local installation.

The entire installation bootstraps from standard Linux AMIs, enabling adjustment of the tools, genome data and installation without needing to prepare custom AMIs. The implementation uses Elasticluster to provision and configure the cluster. We automate the process with the boto Python interface to AWS and Ansible scripts. bcbio-vm isolates code and tools inside a Docker container allowing runs on any remote machine with a download of the Docker image and access to the shared filesystem. Analyses run directly from S3 buckets, with automatic streaming download of input data and upload of final processed data.

We provide timing benchmarks for running a full variant calling analysis using bcbio on AWS. The benchmark dataset was a cancer tumor/normal evaluation, from the ICGC-TCGA DREAM challenge, with 100x coverage in exome regions. We compared the results of running this dataset on 2 different networked filesystems: Lustre and NFS. We also show benchmarks for an RNA-seq dataset using inputs from the Sequencing Quality Control (SEQC) project.

We developed bcbio on AWS and ran these timing benchmarks thanks to work with great partners. A collaboration with Biogen and Rudy Tanzi’s group at MGH funded the development of bcbio on AWS. A second collaboration with Intel Health and Life Sciences and AstraZenenca funded the somatic variant calling benchmarking work. We’re thankful for all the relationships that make this work possible:

  • John Morrissey automated the process of starting a bcbio cluster on AWS and attaching a Lustre filesystem. He also automated the approach to generating graphs of resource usage from collectl stats and provided critical front line testing and improvements to all the components of the bcbio AWS interaction.
  • Kristina Kermanshahche and Robert Read at Intel provided great support helping us get the Lustre ICEL CloudFormation templates running.
  • Ronen Artzi, Michael Heimlich, and Justin Johnson at AstraZenenca setup Lustre, Gluster and NFS benchmarks using a bcbio StarCluster instance. This initial validation was essential for convincing us of the value of moving to a shared filesystem on AWS.
  • Jason Tetrault, Karl Gutwin and Hank Wu at Biogen provided valuable feedback, suggestions and resources for developing bcbio on AWS.
  • Glen Otero parsed the collectl data and provided graphs, which gave us a detailed look into the potential causes of bottlenecks we found in the timings.
  • James Cuff, Paul Edmon and the team at Harvard FAS research computing built and administered the Regal Lustre setup used for local testing.
  • John Kern and other members of the bcbio community tested, debugged and helped identify issues with the implementation. Community feedback and contributions are essential to bcbio development.

Architecture

The implementation provides both a practical way to run large scale variant calling and RNA-seq analysis, as well as a flexible backend architecture suitable for production quality runs. This writeup might feel a bit like a black triangle moment since I also wrote about running bcbio on AWS three years ago. That implementation was a demonstration for small scale usage rather than a production ready system. We now have a setup we can support and run on large scale projects thanks to numerous changes in the backend architecture:

  • Amazon, and cloud based providers in general, now provide high end filesystems and networking. Our AWS runs are fast because they use SSD backend storage, fast networking connectivity and high end processors that would be difficult to invest in for a local cluster. Renting these is economically feasible now that we have an approach to provision resources, run the analysis, and tear everything down. The dichotomy between local cluster hardware and cloud hardware will continue to expand with upcoming improvements in compute (Haswell processors) and storage (16Tb EBS SSD volumes).
  • Isolating all of the software and code inside Docker containers enables rapid deployment of fixes and improvements. From an open source support perspective, Amazon provides a consistent cluster environment we have full control over, limiting the space of potential system specific issues. From a researcher’s perspective, this will allow use of bcbio without needing to spend time installing and testing locally.
  • The setup runs from standard Linux base images using Ansible scripts and Elasticluster. This means we no longer need to build and update AMIs for changes in the architecture or code. This simplifies testing and pushing fixes, letting us spend less time on support and more on development. Since we avoid having a pre-built AMI, the process of building and running bcbio on AWS is fully auditable for both security and scientific accuracy. Finally, it provides a path to support bcbio on container specific management services like Amazon’s EC2 container service.
  • All long term data storage happens in Amazon’s S3 object store, including both analysis specific data as well as general reference genome data. Downloading reference data for an analysis on demand removes the requirement to maintain large shared EBS volumes. On the analysis side, you maintain only the input files and high value output files in S3, removing the intermediates upon completion of the analysis. Removing the need to manage storage of EBS volumes also provides a cost savings ($0.03/Gb/month for S3 versus $0.10+/Gb/month for EBS) and allows the option of archiving in Glacier for long term storage.

All of these architectural changes provide a setup that is easier to maintain and scale over time. Our goal moving ahead is to provide a researcher friendly interface to setting up and running analyses. We hope to achieve that through the in-development Common Workflow Language from Galaxy, Arvados, Seven Bridges, Taverna and the open bioinformatics community.

Variant calling – benchmarking AWS versus local

We benchmarked somatic variant calling in two environments: on the elasticluster Docker AWS implementation and on local Harvard FAS machines.

  • AWS processing was twice as fast as a local run. The gains occur in disk IO intensive steps like alignment post-processing. AWS offers the opportunity to rent SSD backed storage and obtain a 10GigE connected cluster without contention for network resources. Our local test machines have an in-production Lustre filesystem attached to a large highly utilized cluster provided by Harvard FAS research computing.
  • At this scale Lustre and NFS have similar throughput, with Lustre outperforming NFS during IO intensive steps like alignment, post-processing and large BAM file merging. From previous benchmarking work we’ll need to process additional samples in parallel to fully stress the shared filesystem and differentiate Lustre versus NFS performance. However, the resource plots at this scale show potential future bottlenecks during alignment, post-processing and other IO intensive steps. Generally, having Lustre scaled across 4 LUNs per object storage server (OSS) enables better distribution of disk and network resources.

AWS runs use two c3.8xlarge instances clustered in a single placement group, providing 32 cores and 60Gb of memory per machine. Our local run was comparable with two compute machines, each with 32 cores and 128Gb of memory, connected to a Lustre filesystem. The benchmark is a cancer tumor/normal evaluation consisting of alignment, recalibration, realignment and variant detection with four different callers. The input is a tumor/normal pair from the the ICGC-TCGA DREAM challenge with 100x coverage in exome regions. Here are the times, in hours, for each benchmark:

  AWS (Lustre) AWS (NFS) Local (Lustre)
Total 4:42 5:05 10:30
genome data preparation 0:04 0:10  
alignment preparation 0:12 0:15  
alignment 0:29 0:52 0:53
callable regions 0:44 0:44 1:25
alignment post-processing 0:13 0:21 4:36
variant calling 2:35 2:05 2:36
variant post-processing 0:05 0:03 0:22
prepped BAM merging 0:03 0:18 0:06
validation 0:05 0:05 0:09
population database 0:06 0:07 0:09

To provide more insight into the timing differences between these benchmarks, we automated collection and plotting of resource usage on AWS runs.

Variant calling – resource usage plots

bcbio retrieves collectl usage statistics from the server and prepares graphs of CPU, memory, disk and network usage. These plots allow in-depth insight into limiting factors during individual steps in the workflow.

We’ll highlight some interesting comparisons between NFS and Lustre during the variant calling benchmarking. During this benchmark, the two critical resources were CPU usage and disk IO on the shared filesystems. We also measure memory usage but that was not a limiting factor with these analyses. In addition to the comparisons highlighted below, we have the full set of resource usage graphs available for each run:

CPU

These plots compare CPU usage during processing steps for Lustre and NFS. The largest differences between the two runs are in the alignment, alignment post-processing and variant calling steps:

NFS

Lustre


CPU resource usage for Lustre during variant calling

For alignment and alignment post-processing the Lustre runs show more stable CPU usage. NFS specifically spends more time in the CPU wait state (red line) during IO intensive steps. On larger scale projects this may become a limiting factor in processing throughput. The variant calling step was slower on Lustre than NFS, with inconsistent CPU usage. We’ll have to investigate this slowdown further, since no other metrics point to an obvious bottleneck.

Shared filesystem network usage and IO

These plots compare network usage during processing for Lustre and NFS. We use this as a consistent proxy for the performance of the shared filesystem and disk IO (the NFS plots do have directly measured disk IO for comparison purposes).

NFS

Lustre


Network resource usage Lustre

The biggest difference in the IO intensive steps is that Lustre network usage is smoother compared to the spiky NFS input/output, due to spreading out read/writes over multiple disks. Including more processes with additional read/writes will help determine how these differences translate to scaling on larger numbers of simultaneous samples.

RNA-seq benchmarking

We also ran an RNA-seq analysis using 4 samples from the Sequencing Quality Control (SEQC) project. Each sample has 15 million 100bp paired reads. bcbio handled trimming, alignment with STAR, and quantitation with DEXSeq and Cufflinks. We ran on a single AWS c3.8xlarge machines with 32 cores, 60Gb of memory, and attached SSD storage.

RNA-seq optimization in bcbio is at an earlier stage than variant calling. We’ve done work to speed up trimming and aligning, but haven’t yet optimized the expression and count steps. The analysis runs quickly in 6 1/2 hours, but there is still room for further optimization, and this is a nice example of how we can use benchmarking plots to identify targets for additional work:

Total 6:25
genome data preparation 0:32
adapter trimming 0:32
alignment 0:24
estimate expression 3:41
quality control 1:16

The RNA-seq collectl plots show the cause of the slower steps during expression estimation and quality control. Here is CPU usage over the run:


RNA-seq CPU usage

The low CPU usage during the first 2 hours of expression estimation corresponds to DEXSeq running serially over the 4 samples. In contrast with Cufflinks, which parallelizes over all 32 cores, DEXSeq runs in a single core. We could run these steps in parallel by using multiprocessing to launch the jobs, split by sample. Similarly, the QC steps could benefit from parallel processing. Alternatively, we’re looking at validating other approaches for doing quantification like eXpress. These are the type of benchmarking and validation steps that are continually ongoing in the development of bcbio pipelines.

Reproducing the analysis

The process to launch the cluster and an NFS or optional Lustre shared filesystem is fully automated and documented. It sets up permissions, VPCs, clusters and shared filesystems from a basic AWS account, so requires minimal manual work. bcbio_vm.py has commands to:

  • Add an IAM user, a VPC and create the Elasticluster config.
  • Launch a cluster and bootstrap with the latest bcbio code and data.
  • Create and mount a Lustre filesystem attached to the cluster.
  • Terminate the cluster and Lustre stack upon completion.

The processing handles download of input data from S3 and upload back to S3 on finalization. We store data encrypted on S3 and manage access using IAM instance profiles. The examples below show how to run both a somatic variant calling evaluation and an RNA-seq evaluation.

Running the somatic variant calling evaluation

This analysis performs evaluation of variant calling using tumor/normal somatic sample from the DREAM challenge. To run, prepare an S3 bucket to run the analysis from. Copy the configuration file to your own personal bucket and add a GATK jar. You can use the AWS console or any available S3 client to do this. For example, using the AWS command line client:

aws s3 mb s3://YOURBUCKET-syn3-eval/
aws s3 cp s3://bcbio-syn3-eval/cancer-dream-syn3-aws.yaml s3://YOURBUCKET-syn3-eval/
aws s3 cp GenomeAnalysisTK.jar s3://YOURBUCKET-syn3-eval/jars/

Now ssh to the cluster head node, create the work directory and use bcbio_vm to create a batch script that we submit to SLURM. This example uses an attached Lustre filesystem:

bcbio_vm.py elasticluster ssh bcbio
sudo mkdir -p /scratch/cancer-dream-syn3-exome
sudo chown ubuntu !$
cd !$ && mkdir work && cd work
bcbio_vm.py ipythonprep s3://YOURBUCKET-syn3-eval/cancer-dream-syn3-aws.yaml \
                        slurm cloud -n 60
sbatch bcbio_submit.sh

This runs alignment and variant calling with multiple callers (MuTect, FreeBayes, VarDict and VarScan), validates against the DREAM validation dataset truth calls and uploads the results back to S3 in YOURBUCKET-syn3-eval/final.

Running the RNA-seq evaluation

This example runs an RNA-seq analysis using inputs from the Sequencing Quality Control (SEQC) project. Full details on the analysis are available in the bcbio example run documentation. To setup the run, we copy the input configuration from a publicly available S3 bucket into your own personal bucket:

aws s3 mb s3://YOURBUCKET-eval-rna-seqc/
aws s3 cp s3://bcbio-eval-rna-seqc/eval-rna-seqc.yaml s3://YOURBUCKET-eval-rnaseqc/

Now ssh to the cluster head node, create the work directory and use bcbio_vm to run on 32 cores using the attached EBS SSD filesystem:

bcbio_vm.py elasticluster ssh bcbio
mkdir -p ~/run/eval-rna-seqc/work
cd ~/run/eval-rna-seqc/work
bcbio_vm.py run s3://YOURBUCKET-eval-rna-seqc/eval-rna-seqc.yaml -n 32

This will process three replicates from two different SEQC panels, performing adapter trimming, alignment with STAR and produce counts, Cufflinks quantitation and quality control metrics. The results upload back into your initial S3 bucket as YOURBUCKET-eval-rna-seqc/final, and you can shut down the cluster used for processing.

Costs per hour

These are the instance costs, per hour, for running a 2 node 64 core cluster and associated Lustre filesystem on AWS. For NFS runs, the compute costs are identical but there will not be any additional instances for the shared filesystem. Other run costs will include EBS volumes, but these are small ($0.13/Gb/month) compared to the instance costs over these time periods. We use S3 for long term storage rather than the Lustre or NFS filesystems.

  AWS type n each total
compute entry node c3.large 1 $0.11  
compute worker nodes c3.8xlarge 2 $1.68  
        $3.47/hr
ost (object data store) c3.2xlarge 4 $0.42  
mdt (metadata target) c3.4xlarge 1 $0.84  
mgt (management target) c3.xlarge 1 $0.21  
NATDevice m3.medium 1 $0.07  
Lustre licensing   1 $0.48  
        $3.28/hr
        $6.75/hr

Work to do

The bcbio AWS implementation is freely available and documented and we’ll continue to develop and support it. Some of the areas of immediate improvement we hope to tackle in the future include:

We welcome feedback on the implementation, usage and benchmarking results.

Written by Brad Chapman

December 19, 2014 at 10:17 am

12 Responses

Subscribe to comments with RSS.

  1. Brad – fantastic work! Congrats to all involved!

    Jeff Layton

    December 19, 2014 at 10:56 am

  2. […] 1. Benchmarking variation and RNA-seq analyses on Amazon Web Services with Docker (link) […]

  3. Hey Brad,

    Nice work, but the numbers for @fasrc are ways off. We benched that storage at over 14GB/s and it’s only any use when you run 1000’s of jobs, this is a two node test. Least I now know what that large SLURM allocation we gave you was used for. You are not comparing apples to apples here, we are also not spinning SSD on local disk. Just seems a bit off to use us as your whipping boy – we are trying really hard to keep you and 5,000 of your friends happy :-)

    http://blog.jcuff.net/2013/07/of-benching-wicked-fast-storage-array.html

    James Cuff (@jamesdotcuff)

    December 20, 2014 at 11:06 am

    • James;
      Apologies. It was not my intention to criticize FAS or try and provide a benchmark of regal’s capabilities. In contrast, I thought it performed awesomely compared to the most expensive hardware you can rent at AWS. The point of the comparison was to show if you switch to SSDs you can remove a bottleneck at two IO intensive steps (callable regions and alignment post-processing). I included it because I thought it was a useful finding and wanted to give you all credit for the work you do supporting science at scale.

      Happy to adjust the post if it reads like a criticism of regal or FAS. It is a brilliant setup and we’ve done lots of awesome science there.

      Brad Chapman

      December 20, 2014 at 1:33 pm

    • Not seeing anyone being used as a whipping boy. If anything the cost involved to run something on the same scale as Regal (or otherwise) should have most users flinching, and that’s certainly one table I keep pointing FAS users at who expect thousands of CPU hours at zero cost — in the vague hope that they appreciate the work done at FAS, HMS and elsewhere.

      Oliver Hofmann

      December 20, 2014 at 4:13 pm

  4. […] A very cool post on the Blue Collar Bioinformatics Blog talked about Benchmarking Variation and RNA-seq Analyses on Amazon Web Services with Docker. […]

  5. Awesome post Brad, thanks for sharing!

    I was wondering if you could elaborate on a little detail: are you running the variant calling and somatic analysis on an SGE deployed by StarCluster?

    You mention that the variant calling was done on two c3.8xlarge, was that used both for Luster/NFS and compute or you are using a separate machine(s) for compute and others for shared storage.

    Also in terms of your load balancing for the sub-jobs, are you using the binning feature? if so, could you please elaborate on how many jobs you are breaking the analyses to (e.g. are you breaking the data on chromosomes, or how many aligner instances do you run) In general I am interested to know at what times there is a requirement for passing the data between the nodes through NFS/Luster (as opposed to running the jobs locally on the nodes using scratch disks).

    My understanding of bcbio is limited to reading your developer docs and I don’t have much hands on experience so apologies in advance if the answers are trivial :)

    • Amirhossein;
      Thanks for all the comments and questions. I’ll try to tackle them all in order:

      – This uses Elasticluster and SLURM for setting up the cluster and managing jobs, respectively. There are more details about the setup in the documentation (https://bcbio-nextgen.readthedocs.org/en/latest/contents/cloud.html). It is a similar setup to StarCluster but we preferred this since we don’t need to create custom AMIs.

      – The two c3.8xlarge machines are for compute only. NFS storage runs from the head node, which is a c3.large instance. Lustre requires multiple machines and all of the used resources are listed in the second section of the table under ‘Costs per hour’

      – The jobs get broken up during alignment and variant calling. For alignment it split into chunks of 5 million reads (https://github.com/chapmanb/bcbio-nextgen/blob/master/config/examples/cancer-dream-syn3.yaml#L8) and this is configurable. For variant calling, we parallelize by genomic region trying to be more homogenous than splitting by chromosome since chromosome 1 dominates processing times. This older benchmarking post describes the approach under ‘Parallelism by genomic regions’ http://bcb.io/2013/05/22/scaling-variant-detection-pipelines-for-whole-genome-sequencing-analysis/

      – We do have the ability to use local scratch disk for writing, instead of using the shared filesystem directly. The files get written locally and then copied over to the final shared filesystem. We didn’t use that in this benchmark since we didn’t appear to be saturating the network filesystems. I’m absolutely agreed that it could be useful and would like to explore it in the future.

      Thanks again.

      Brad Chapman

      January 4, 2015 at 12:38 pm

  6. Hi,
    Very Nice Work!
    I would like to reproduce the NA12878 analysis so first of all I should keep the same raw data we all used.
    However, I found the ftp. you mentioned on the website is invalid(https://raw.github.com/chapmanb/bcbio-nextgen/master/config/examples/NA12878-trio-wgs-validate-getdata.sh).
    Could you give me the raw data (alignment Bam files or Fastq files) of NA12878?

    Weijian

    January 17, 2015 at 7:55 am

  7. Weijian;
    Glad this was useful and sorry you’re having access problems for getting the data. Could you provide more specific details about what URLs are invalid? The FTP sites in the shell script are from NCBI and EBI and are all working for me now. Is it possible one of the sites was temporarily down or inaccessible? I hope re-running it works cleanly for you.

    Brad Chapman

    January 17, 2015 at 1:58 pm

    • Hi Brad Chapman
      Thank you very much for your help! Now I can have access to it.
      I merely find the data of NA12878 you used so I am sorry that I don’t know if there is other URL invalid.

      Weijian

      January 18, 2015 at 8:22 pm

  8. […] Benchmarking variation and RNA-seq analyses on Amazon Web Services with Docker […]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: