Blue Collar Bioinformatics

Note: new posts have moved to http://bcb.io/ Please look there for the latest updates and comments

Posts Tagged ‘docker

Benchmarking variation and RNA-seq analyses on Amazon Web Services with Docker

with 12 comments

Overview

We developed a freely available, easy to run implementation of bcbio-nextgen on Amazon Web Services (AWS) using Docker. bcbio is a community developed tool providing validated and scalable variant calling and RNA-seq analysis. The AWS implementation automates all of the steps of building a cluster, attaching high performance shared filesystems, and running an analysis. This makes bcbio readily available to the research community without the need to install and configure a local installation.

The entire installation bootstraps from standard Linux AMIs, enabling adjustment of the tools, genome data and installation without needing to prepare custom AMIs. The implementation uses Elasticluster to provision and configure the cluster. We automate the process with the boto Python interface to AWS and Ansible scripts. bcbio-vm isolates code and tools inside a Docker container allowing runs on any remote machine with a download of the Docker image and access to the shared filesystem. Analyses run directly from S3 buckets, with automatic streaming download of input data and upload of final processed data.

We provide timing benchmarks for running a full variant calling analysis using bcbio on AWS. The benchmark dataset was a cancer tumor/normal evaluation, from the ICGC-TCGA DREAM challenge, with 100x coverage in exome regions. We compared the results of running this dataset on 2 different networked filesystems: Lustre and NFS. We also show benchmarks for an RNA-seq dataset using inputs from the Sequencing Quality Control (SEQC) project.

We developed bcbio on AWS and ran these timing benchmarks thanks to work with great partners. A collaboration with Biogen and Rudy Tanzi’s group at MGH funded the development of bcbio on AWS. A second collaboration with Intel Health and Life Sciences and AstraZenenca funded the somatic variant calling benchmarking work. We’re thankful for all the relationships that make this work possible:

  • John Morrissey automated the process of starting a bcbio cluster on AWS and attaching a Lustre filesystem. He also automated the approach to generating graphs of resource usage from collectl stats and provided critical front line testing and improvements to all the components of the bcbio AWS interaction.
  • Kristina Kermanshahche and Robert Read at Intel provided great support helping us get the Lustre ICEL CloudFormation templates running.
  • Ronen Artzi, Michael Heimlich, and Justin Johnson at AstraZenenca setup Lustre, Gluster and NFS benchmarks using a bcbio StarCluster instance. This initial validation was essential for convincing us of the value of moving to a shared filesystem on AWS.
  • Jason Tetrault, Karl Gutwin and Hank Wu at Biogen provided valuable feedback, suggestions and resources for developing bcbio on AWS.
  • Glen Otero parsed the collectl data and provided graphs, which gave us a detailed look into the potential causes of bottlenecks we found in the timings.
  • James Cuff, Paul Edmon and the team at Harvard FAS research computing built and administered the Regal Lustre setup used for local testing.
  • John Kern and other members of the bcbio community tested, debugged and helped identify issues with the implementation. Community feedback and contributions are essential to bcbio development.

Architecture

The implementation provides both a practical way to run large scale variant calling and RNA-seq analysis, as well as a flexible backend architecture suitable for production quality runs. This writeup might feel a bit like a black triangle moment since I also wrote about running bcbio on AWS three years ago. That implementation was a demonstration for small scale usage rather than a production ready system. We now have a setup we can support and run on large scale projects thanks to numerous changes in the backend architecture:

  • Amazon, and cloud based providers in general, now provide high end filesystems and networking. Our AWS runs are fast because they use SSD backend storage, fast networking connectivity and high end processors that would be difficult to invest in for a local cluster. Renting these is economically feasible now that we have an approach to provision resources, run the analysis, and tear everything down. The dichotomy between local cluster hardware and cloud hardware will continue to expand with upcoming improvements in compute (Haswell processors) and storage (16Tb EBS SSD volumes).
  • Isolating all of the software and code inside Docker containers enables rapid deployment of fixes and improvements. From an open source support perspective, Amazon provides a consistent cluster environment we have full control over, limiting the space of potential system specific issues. From a researcher’s perspective, this will allow use of bcbio without needing to spend time installing and testing locally.
  • The setup runs from standard Linux base images using Ansible scripts and Elasticluster. This means we no longer need to build and update AMIs for changes in the architecture or code. This simplifies testing and pushing fixes, letting us spend less time on support and more on development. Since we avoid having a pre-built AMI, the process of building and running bcbio on AWS is fully auditable for both security and scientific accuracy. Finally, it provides a path to support bcbio on container specific management services like Amazon’s EC2 container service.
  • All long term data storage happens in Amazon’s S3 object store, including both analysis specific data as well as general reference genome data. Downloading reference data for an analysis on demand removes the requirement to maintain large shared EBS volumes. On the analysis side, you maintain only the input files and high value output files in S3, removing the intermediates upon completion of the analysis. Removing the need to manage storage of EBS volumes also provides a cost savings ($0.03/Gb/month for S3 versus $0.10+/Gb/month for EBS) and allows the option of archiving in Glacier for long term storage.

All of these architectural changes provide a setup that is easier to maintain and scale over time. Our goal moving ahead is to provide a researcher friendly interface to setting up and running analyses. We hope to achieve that through the in-development Common Workflow Language from Galaxy, Arvados, Seven Bridges, Taverna and the open bioinformatics community.

Variant calling – benchmarking AWS versus local

We benchmarked somatic variant calling in two environments: on the elasticluster Docker AWS implementation and on local Harvard FAS machines.

  • AWS processing was twice as fast as a local run. The gains occur in disk IO intensive steps like alignment post-processing. AWS offers the opportunity to rent SSD backed storage and obtain a 10GigE connected cluster without contention for network resources. Our local test machines have an in-production Lustre filesystem attached to a large highly utilized cluster provided by Harvard FAS research computing.
  • At this scale Lustre and NFS have similar throughput, with Lustre outperforming NFS during IO intensive steps like alignment, post-processing and large BAM file merging. From previous benchmarking work we’ll need to process additional samples in parallel to fully stress the shared filesystem and differentiate Lustre versus NFS performance. However, the resource plots at this scale show potential future bottlenecks during alignment, post-processing and other IO intensive steps. Generally, having Lustre scaled across 4 LUNs per object storage server (OSS) enables better distribution of disk and network resources.

AWS runs use two c3.8xlarge instances clustered in a single placement group, providing 32 cores and 60Gb of memory per machine. Our local run was comparable with two compute machines, each with 32 cores and 128Gb of memory, connected to a Lustre filesystem. The benchmark is a cancer tumor/normal evaluation consisting of alignment, recalibration, realignment and variant detection with four different callers. The input is a tumor/normal pair from the the ICGC-TCGA DREAM challenge with 100x coverage in exome regions. Here are the times, in hours, for each benchmark:

  AWS (Lustre) AWS (NFS) Local (Lustre)
Total 4:42 5:05 10:30
genome data preparation 0:04 0:10  
alignment preparation 0:12 0:15  
alignment 0:29 0:52 0:53
callable regions 0:44 0:44 1:25
alignment post-processing 0:13 0:21 4:36
variant calling 2:35 2:05 2:36
variant post-processing 0:05 0:03 0:22
prepped BAM merging 0:03 0:18 0:06
validation 0:05 0:05 0:09
population database 0:06 0:07 0:09

To provide more insight into the timing differences between these benchmarks, we automated collection and plotting of resource usage on AWS runs.

Variant calling – resource usage plots

bcbio retrieves collectl usage statistics from the server and prepares graphs of CPU, memory, disk and network usage. These plots allow in-depth insight into limiting factors during individual steps in the workflow.

We’ll highlight some interesting comparisons between NFS and Lustre during the variant calling benchmarking. During this benchmark, the two critical resources were CPU usage and disk IO on the shared filesystems. We also measure memory usage but that was not a limiting factor with these analyses. In addition to the comparisons highlighted below, we have the full set of resource usage graphs available for each run:

CPU

These plots compare CPU usage during processing steps for Lustre and NFS. The largest differences between the two runs are in the alignment, alignment post-processing and variant calling steps:

NFS

Lustre


CPU resource usage for Lustre during variant calling

For alignment and alignment post-processing the Lustre runs show more stable CPU usage. NFS specifically spends more time in the CPU wait state (red line) during IO intensive steps. On larger scale projects this may become a limiting factor in processing throughput. The variant calling step was slower on Lustre than NFS, with inconsistent CPU usage. We’ll have to investigate this slowdown further, since no other metrics point to an obvious bottleneck.

Shared filesystem network usage and IO

These plots compare network usage during processing for Lustre and NFS. We use this as a consistent proxy for the performance of the shared filesystem and disk IO (the NFS plots do have directly measured disk IO for comparison purposes).

NFS

Lustre


Network resource usage Lustre

The biggest difference in the IO intensive steps is that Lustre network usage is smoother compared to the spiky NFS input/output, due to spreading out read/writes over multiple disks. Including more processes with additional read/writes will help determine how these differences translate to scaling on larger numbers of simultaneous samples.

RNA-seq benchmarking

We also ran an RNA-seq analysis using 4 samples from the Sequencing Quality Control (SEQC) project. Each sample has 15 million 100bp paired reads. bcbio handled trimming, alignment with STAR, and quantitation with DEXSeq and Cufflinks. We ran on a single AWS c3.8xlarge machines with 32 cores, 60Gb of memory, and attached SSD storage.

RNA-seq optimization in bcbio is at an earlier stage than variant calling. We’ve done work to speed up trimming and aligning, but haven’t yet optimized the expression and count steps. The analysis runs quickly in 6 1/2 hours, but there is still room for further optimization, and this is a nice example of how we can use benchmarking plots to identify targets for additional work:

Total 6:25
genome data preparation 0:32
adapter trimming 0:32
alignment 0:24
estimate expression 3:41
quality control 1:16

The RNA-seq collectl plots show the cause of the slower steps during expression estimation and quality control. Here is CPU usage over the run:


RNA-seq CPU usage

The low CPU usage during the first 2 hours of expression estimation corresponds to DEXSeq running serially over the 4 samples. In contrast with Cufflinks, which parallelizes over all 32 cores, DEXSeq runs in a single core. We could run these steps in parallel by using multiprocessing to launch the jobs, split by sample. Similarly, the QC steps could benefit from parallel processing. Alternatively, we’re looking at validating other approaches for doing quantification like eXpress. These are the type of benchmarking and validation steps that are continually ongoing in the development of bcbio pipelines.

Reproducing the analysis

The process to launch the cluster and an NFS or optional Lustre shared filesystem is fully automated and documented. It sets up permissions, VPCs, clusters and shared filesystems from a basic AWS account, so requires minimal manual work. bcbio_vm.py has commands to:

  • Add an IAM user, a VPC and create the Elasticluster config.
  • Launch a cluster and bootstrap with the latest bcbio code and data.
  • Create and mount a Lustre filesystem attached to the cluster.
  • Terminate the cluster and Lustre stack upon completion.

The processing handles download of input data from S3 and upload back to S3 on finalization. We store data encrypted on S3 and manage access using IAM instance profiles. The examples below show how to run both a somatic variant calling evaluation and an RNA-seq evaluation.

Running the somatic variant calling evaluation

This analysis performs evaluation of variant calling using tumor/normal somatic sample from the DREAM challenge. To run, prepare an S3 bucket to run the analysis from. Copy the configuration file to your own personal bucket and add a GATK jar. You can use the AWS console or any available S3 client to do this. For example, using the AWS command line client:

aws s3 mb s3://YOURBUCKET-syn3-eval/
aws s3 cp s3://bcbio-syn3-eval/cancer-dream-syn3-aws.yaml s3://YOURBUCKET-syn3-eval/
aws s3 cp GenomeAnalysisTK.jar s3://YOURBUCKET-syn3-eval/jars/

Now ssh to the cluster head node, create the work directory and use bcbio_vm to create a batch script that we submit to SLURM. This example uses an attached Lustre filesystem:

bcbio_vm.py elasticluster ssh bcbio
sudo mkdir -p /scratch/cancer-dream-syn3-exome
sudo chown ubuntu !$
cd !$ && mkdir work && cd work
bcbio_vm.py ipythonprep s3://YOURBUCKET-syn3-eval/cancer-dream-syn3-aws.yaml \
                        slurm cloud -n 60
sbatch bcbio_submit.sh

This runs alignment and variant calling with multiple callers (MuTect, FreeBayes, VarDict and VarScan), validates against the DREAM validation dataset truth calls and uploads the results back to S3 in YOURBUCKET-syn3-eval/final.

Running the RNA-seq evaluation

This example runs an RNA-seq analysis using inputs from the Sequencing Quality Control (SEQC) project. Full details on the analysis are available in the bcbio example run documentation. To setup the run, we copy the input configuration from a publicly available S3 bucket into your own personal bucket:

aws s3 mb s3://YOURBUCKET-eval-rna-seqc/
aws s3 cp s3://bcbio-eval-rna-seqc/eval-rna-seqc.yaml s3://YOURBUCKET-eval-rnaseqc/

Now ssh to the cluster head node, create the work directory and use bcbio_vm to run on 32 cores using the attached EBS SSD filesystem:

bcbio_vm.py elasticluster ssh bcbio
mkdir -p ~/run/eval-rna-seqc/work
cd ~/run/eval-rna-seqc/work
bcbio_vm.py run s3://YOURBUCKET-eval-rna-seqc/eval-rna-seqc.yaml -n 32

This will process three replicates from two different SEQC panels, performing adapter trimming, alignment with STAR and produce counts, Cufflinks quantitation and quality control metrics. The results upload back into your initial S3 bucket as YOURBUCKET-eval-rna-seqc/final, and you can shut down the cluster used for processing.

Costs per hour

These are the instance costs, per hour, for running a 2 node 64 core cluster and associated Lustre filesystem on AWS. For NFS runs, the compute costs are identical but there will not be any additional instances for the shared filesystem. Other run costs will include EBS volumes, but these are small ($0.13/Gb/month) compared to the instance costs over these time periods. We use S3 for long term storage rather than the Lustre or NFS filesystems.

  AWS type n each total
compute entry node c3.large 1 $0.11  
compute worker nodes c3.8xlarge 2 $1.68  
        $3.47/hr
ost (object data store) c3.2xlarge 4 $0.42  
mdt (metadata target) c3.4xlarge 1 $0.84  
mgt (management target) c3.xlarge 1 $0.21  
NATDevice m3.medium 1 $0.07  
Lustre licensing   1 $0.48  
        $3.28/hr
        $6.75/hr

Work to do

The bcbio AWS implementation is freely available and documented and we’ll continue to develop and support it. Some of the areas of immediate improvement we hope to tackle in the future include:

We welcome feedback on the implementation, usage and benchmarking results.

Written by Brad Chapman

December 19, 2014 at 10:17 am

Improving reproducibility and installation of genomic analysis pipelines with Docker

with one comment

Motivation

bcbio-nextgen is a community developed, best-practice pipeline for genomic data processing, performing automated variant calling and RNA-seq analyses from high throughput sequencing data. It has an automated installation script that sets up the code and third party tools used during analysis, and we’ve been working on improving the process to make getting started with bcbio-nextgen easier. The current approach of installing tools in a separate semi-isolated directory is non-optimal for a couple of reasons:

  • A separate directory does not give full isolation from system programs and libraries. It’s possible to disrupt processing by unintentionally including other command line programs into your PATH. Additionally it is not easy to recreate a snapshot of the current environment for reproducibility without manual re-installation of specific versions of software.
  • The automated installation script needs to deal with the peculiarities of heterogeneous cluster environments. Different system characteristics can be tricky to anticipate and automate, and lead to more tickets devoted to install problems than we’d like. The goal is to do more science and spend less time dealing with installation woes.

Docker lightweight linux containers help solve both of these issues. By isolating tools and software involved in processing, installation is as easy as downloading a pre-built image containing the software. By containerizing the running process, software does not interfere with other installed programs. Docker containers provide the isolation and deployment advantages of Virtual Machines without the associated overhead. Additionally they allow easy export of the full software environment used to run an analysis, improving our ability to reproduce results.

This post describes bcbio-nextgen-vm, a wrapper around bcbio-nextgen that runs analyses using pre-created Docker containers. The implementation is feature compatible with bcbio-nextgen but provides improved installation, isolation and reproducibility. I’ll also discuss future work to further improve provenance and traceability of analysis runs with the Arvados platform, and a fun chance to work on reproducibility and provenance at an Arvados hackathon on Tuesday March 11th.

Implementation

We reused the existing bcbio-nextgen installation scripts to create easily distributed Docker images with pipeline code and external tools. In fact, the bcbio-nextgen Dockerfile replicates current best practice recommendations for setting up the pipeline on a local system. CloudBioLinux drives installation of the software, using packaging work from existing communities such as Bio-Linux, DebianMed and homebrew-science. The advantage over the previous installation approach is that this Docker installation takes place in a defined environment and we distribute the pre-built images, avoiding the need to configure and build software on individual systems.

The pre-built Docker image contains a full manifest of installed software, from the system libraries to custom scientific packages. Coupled with the ability to export and save Docker images, this creates a reproducible run environment. Special thanks for the manifest implementation are due to the DebianMed community and Tony Travis. I had time to finish the manifest implementation while at the DebianMed Hackathon in Aberdeen. This critical component helped enable external version queries for Docker isolated software.

Tying all these parts together, the bcbio-nextgen-vm wrapper drives processing of individual run components using isolated Docker containers. The Python wrapper script uses the existing work in bcbio-nextgen for defining workflows, and it runs on distributed cluster systems using the IPython parallel framework. Using Conda and Binstar to handle installation of Python dependencies results in a streamlined installation procedure for all the wrapper software.

The diagram below shows the parts of bcbio-nextgen handled within each of the components of the system. bcbio-nextgen-vm drives the workflow and parallel runs, interacting with a cluster scheduler, and lives outside of Docker on a central server. The wrapper code manages the work of starting Docker containers and mounting external filesystems to local mounts within the Docker container. On each processing node, execution happens within isolated Docker containers with external biological software and bcbio-nextgen processing-specific code.

Overview of docker and bcbio-nextgen-vm components

Availability

The initial v0.1.0 release of bcbio-nextgen-vm contains full support for all bcbio-nextgen functionality using isolated Docker containers. It runs on clusters using IPython parallel and on single machines using multiple cores, and has minimal external requirements beyond Docker. See the full installation instructions, and bcbio-nextgen-vm run instructions to get started with processing your samples. It uses the same infrastructure and input files as bcbio-nextgen, so the bcbio-nextgen documentation contains much more detail on defining the biological pipelines to run.

With the new isolated framework, you can install bcbio-nextgen on a system with only Docker installed. Conda handles installation of the Python dependencies, ideally inside of an isolated minimal Anaconda Python environment, and is the only non-Docker-contained infrastructure required. The install script will also download and prepare biological data required for processing, including genomes, index files and annotations.

We’re hoping to migrate bcbio-nextgen to this Docker enabled framework over time and welcome feedback on installation or usage challenges that still exist. Reporting problems on the GitHub issue tracker would be a major help as we continue to develop and improve the wrapper framework.

One area of particular interest is installation and security on cluster systems. While patiently waiting for the ability to run Docker as a non-root user, we recommend installing bcbio-nextgen-vm to run with the docker group id on execution. The internal scripts within the bcbio-nextgen Docker container run all commands as the calling user to mitigate security issues.

Provenance and further work

Adding Docker isolated containers provides the pipeline with improved reproducibility. Maintaining the full state of all the tools and software only requires exporting and gzipping the Docker image and storing it alongside the final processed result. The 1Gb stored image can be later reconstituted and rerun to reproduce earlier results, or shared with collaborators to ensure identical processing pipelines across multiple locations. Saving the initial input data plus the Docker image provides the ability to re-run an analysis at any point in the future.

With this framework in place, the next step for improving reproducibility is enabling full provenance to trace processing steps. bcbio-nextgen currently has extensive log files of command lines and program output, but in parallel environments it requires work to deconvolute these to establish the full set of steps leading up to production of files of interest.

Arvados is an promising open source framework designed to help handle provenance and run tracking. Curoverse provides commercial support and development for the Arvados platform and recently closed a round of financing as they continue to expand and develop the framework.

If you’re interested in reproducibility and provenance, and live in the Boston area, Curoverse is hosting an Arvados hackathon next Tuesday evening, March 11th at their offices. I’ll be there learning about ways to integrate bcbio-nextgen with the work they’re doing and would be happy to talk with anyone about the Docker work or reproducible pipelines in general.

Written by Brad Chapman

March 6, 2014 at 5:37 am