Blue Collar Bioinformatics

Note: new posts have moved to Please look there for the latest updates and comments

Genomics X Prize public phase: reference genome preparation and comparisons to Illumina and Complete Genomics

with 3 comments


The Archon Genomics X Prize, presented by Express Scripts, is a 10 million dollar competition to establish highly accurate clinical grade sequencing and variation detection methods. Our group at Harvard School of Public Health works with the EdgeBio team on developing the infrastructure for the competition: identify variations in the grading genomes and provide software to compare these reference variation sets against a competitor’s list of variations.

The exciting aspect of the Genomics X Prize is that it enables open comparisons between sequencing technologies and variant calling methodologies. Sequencing genomes to the high degree of accuracy sufficient for clinical usage is a difficult, open, problem. Here I’ll present detailed numbers comparing variants called by different sequencing technologies and variant callers.

The public phase of the Genomics X Prize starts today, August 15th. The goal of this six month period is to have an open dialog with everyone working in the sequencing and variant calling communities. We want to refine our methods to provide the most accurate and fair variant calling for the reference genomes. To start the discussion we’ve prepared:

The goal of this writeup, and the X Prize public phase, is to iterate over calling and unification methods to improve our algorithms and approaches. Rather than promoting or disparaging any particular technology or calling method, we’re instead providing full transparency and a good-faith effort to combining approaches. Our hope is that this will help engage the community, encourage feedback, and result in a unbiased and accurate set of reference genomes for the competition.

Unification of variant calls

For the August 15th public phase kickoff, we prepared a reference data set of NA19239 based on pooled sequencing of haploid fosmid clones. The callable regions of these clones totaled 129,513,026 total bases, covering ~4% of the 3.1 billion bases in the human genome. We use fosmid clones to obtain complete regional haplotype coverage and focus on partial genome coverage to achieve high coverage depth and accuracy for assessed regions.

Version 0.1 of the NA19239 reference set uses variant calls from two technologies: Illumina and SOLiD; and three callers: GATK’s Unified Genotyper, FreeBayes and SAMtools. To move from these data to a unified call set we:

  • Align to GRCh37 reference genome with Novoalign.
  • Perform post-processing and indel realignment with GATK’s IndelRealigner.
  • Perform variant calling with GATK’s UnifiedGenotyper, FreeBayes and samtools mpileup.
  • Do pairwise comparisons between all technology/caller approaches.
  • Generate the union of all possible calls and merge with initial GATK calls, recalling any no-call positions at expected sites.
  • Use validation information on variants found in multiple technologies, plus metrics associated with common variants, to filter the full call set to a final set of trusted calls.

The challenging decisions begin when merging and filtering the final call set. This requires careful bookkeeping and variant representation to ensure identical variants are directly comparable, followed by setting cutoffs for variant inclusion.

Comparison details

The details of variant comparisons introduce an additional layer of complexity during assessment. The approach we’ve taken is create a normalized set of variants so all comparison differences are due to actual call differences rather than variant representation. We split multiple nucleotide polymorphisms into individual calls, split complex indel-variant combinations, and left-align remaining variants.

For haploid/diploid comparisons, we establish haplotype blocks for the diploid sequence based on phasing provided in the input variant file, and then compare the best matching haplotype to our fosmid reference. Single nucleotide polymorphisms and indels less than 30bp require exact machines between two comparison genomes. Larger indels and structural variations receive more flexible matching with confidence intervals around start and end coordinates.

The goal of the normalized, compared variants is to reflect real underlying differences in calling approaches relative to how well we can currently resolve variation endpoints.

Comparisons between variation callers

For a concrete example of two different variant calling approaches, below is a table comparing GATK variants against samtools calls for the NA19239 sample, using identically aligned and post-processed BAMs:

concordant: total 160851
concordant: SNPs 136146
concordant: indels 24705
GATK discordant: total 13925
GATK discordant: SNPs 1315
GATK discordant: indels 12610
samtools discordant: total 25368
samtools discordant: SNPs 17247
samtools discordant: indels 8121

The number of discordant variant calls is high, making up 8% of the GATK calls and 14% of the samtools calls, and samtools calls almost 16,000 additional SNPs compared to GATK. As a result, a large percentage of variants require making hard decisions: are those additional calls interesting, real variants in samtools and false negatives in the GATK calls? Or conversely, are they false positives in samtools that GATK correctly excludes?

Comparisons between sequencing technologies

There is a similar level of discrepancy when comparing variant calls between Illumina and SOLiD sequencing. Below is a comparison between GATK Unified genotyper calls on the two technologies:

concordant: total 135263
concordant: SNPs 122267
concordant: indels 12996
Illumina discordant: total 39491
Illumina discordant: unique 7079
Illumina discordant: SNPs 15188
Illumina discordant: indels 24303
SOLiD discordant: total 16022
SOLiD discordant: unique 3800
SOLiD discordant: SNPs 3908
SOLiD discordant: indels 12114

Unique coverage explains some differences: 4% of the Illumina variants (7079) and 2.5% (3800) of the SOLiD variants were uniquely covered by the technologies. However, the remaining variant discordant calls are on the order of those seen in the technology comparisons. Adding to the complexity, we find only 84% of the total concordant variants compared to the Illumina only GATK/samtools comparison.

Unified call set

The level of discrepancy between calling methods and sequencing approaches introduces complexity in the preparation of the final call set: How much evidence does a variant need for inclusion? Can single calls be true positives if supported by high confidence values? This will require extensive refinement throughout the public phase. For the initial version 0.1 release of NA19239, we took the following high level approach to filtering:

  • Retain variants found in 4 out of 6 calling/technology methods (including genotyping data).
  • Retain variants identified across multiple technologies.
  • Retain variants found in both more stringent (GATK) and more lenient (FreeBayes, samtools) callers.
  • Assess remaining variants using a Support Vector Machine with quality score, read depth and variant distance from read ends metrics, training the classifier on likely true and false positives from the pairwise overlap comparisons.

The result is a unified call set of 171,009 variants derived from all technologies and callers, that we’re releasing as NA19239 version 0.1.

Comparisons with whole genome datasets

To assess the quality of the unified call set, we compared to two public genomes:

This provides us with three independent call sets to assess variability between approaches. To provide a baseline, here is the comparison of the Illumina and Complete Genomics calls in our assessment regions:

Overall genotype concordance 98.47
concordant: total 205868
concordant: SNPs 186365
concordant: indels 19503
Illumina discordant: total 31267
Illumina discordant: SNPs 19334
Illumina discordant: indels 11933
Complete Genomics discordant: total 15174
Complete Genomics discordant: SNPs 9586
Complete Genomics discordant: indels 5510

We see familiar discordance rates: 13% of the Illumina calls and 7% of the Complete Genomics calls differ. Since it’s diploid versus diploid, this comparison includes all heterozygous variant matches. As a result the numbers in this comparison will be higher, but it is a good order of magnitude approximation for looking at our fosmid reference set versus each individual technology.


The comparison against the Illumina whole genome variant calls contains 12% discordant calls in our fosmid reference set, with 79% of those being indel differences. Indels are notoriously more difficult to identify and assess, so this will be an area of increased focus as we move forward:

concordant: total 150420
concordant: SNPs 132604
concordant: indels 17816
fosmid discordant: total 19624
fosmid discordant: SNPs 4165
fosmid discordant: indels 15459
Illumina discordant: total 5475
Illumina discordant: SNPs 2952
Illumina discordant: indels 2523

Complete Genomics

The Complete Genomics comparison has 17% discordant calls including 2x more discordant SNP calls. This highlights another key area of call set refinement: identifying and correcting for technology specific calls.

concordant: total 139559
concordant: SNPs 126296
concordant: indels 13263
fosmid discordant: total 29883
fosmid discordant: SNPs 10162
fosmid discordant: indels 19721
Complete Genomics discordant: total 7571
Complete Genomics discordant: SNPs 5542
Complete Genomics discordant: indels 1965


The initial NA19239 public genome for the Genomics X Prize provides unified variant calls based on two sequencing technologies and three calling methods. I’ve delved into a lot of details on our approaches, challenges and goals with the hopes of encouraging suggestions from other researchers working on these problems. We’re especially interested in feedback on these areas of ongoing research:

  • Digging deeper into potential false positives and negatives: By combining comparison information between the unified callset and external resources, we can identify 17654 fosmid variants (10%) not found in both the Complete Genomics and Illumina datasets. These require additional in-depth analysis to classify as uniquely identified fosmid calls or potential false positives. Similarly, Illumina and Complete Genomics combine to call 1228 variants (0.7%) that are not in the fosmid call set. These need examination to classify as fosmid false negatives, or false positive calls in the individual technologies.
  • Additional public genomes: We’re actively working with teams like the Genome in a Bottle Consortium and Genome Research Consortium to compare with their reference sets and approaches. Our next target public genome is NA12878, used in both of these projects and widely studied.
  • Improve variant representation and assessment: The variation software framework works hard to make variant representations as uniform as possible. Indels are especially challenging and we welcome practical examples of regions that need additional standardization.
  • Refine approaches to unifying variant calls: What we learn from the additional inspection of discordant variants can help inform improved approaches to filtering. This is a great opportunity to develop generalized, reusable methods for combining variants from multiple approaches.

The call sets used here are available as public data folders on GenomeSpace:

  • Public/chapmanb/xprize/NA19239-v0_1 – The combined final call set along with training true/false positives and Illumina/Complete Genomics comparison based potential false positives and negatives.
  • Public/EdgeBio/PublicData/Release1 – All of the raw input data, including fastq files, BAM alignments and individual variant calls.

Combined with the open source code and configurations, we hope this will provided interested researchers with all the raw materials needed to reproduce and extend these analyses. Your feedback and suggestions are very welcome.

Written by Brad Chapman

August 15, 2012 at 9:10 am

Extending the GATK for custom variant comparisons using Clojure

with 2 comments

The Genome Analysis Toolkit (GATK) is a full-featured library for dealing with next-generation sequencing data. The open-source Java code base, written by the Genome Sequencing and Analysis Group at the Broad Institute, exposes a Map/Reduce framework allowing developers to code custom tools taking advantage of support for: BAM Alignment files through Picard, BED and other interval file formats through Tribble, and variant data in VCF format.

Here I’ll show how to utilize the GATK API from Clojure, a functional, dynamic programming language that targets the Java Virtual Machine. We’ll:

  • Write a GATK walker that plots variant quality scores using the Map/Reduce API.
  • Create a custom annotation that adds a mean neighboring base quality metric using the GATK VariantAnnotator.
  • Use the VariantContext API to parse and access variant information in a VCF file.

The Clojure variation library is freely available and is part of a larger project to provide variant assessment capabilities for the Archon Genomics XPRIZE competition.

Map/Reduce GATK Walker

GATK’s well documented Map/Reduce API eases the development of custom programs for processing BAM and VCF files. The presentation from Eli Lilly is a great introduction to developing your own custom GATK Walkers in Java. Here we’ll follow a similar approach to code these in Clojure.

We’ll start by defining a simple Java base class that extends the base GATK walker and defines an output and input variable. The output is a string specifying the output file to write to, and the input is any type of variant file the GATK accepts. Here we’ll be dealing with VCF input files:

public abstract class BaseVariantWalker extends RodWalker {
  public String out;

  public StandardVariantContextInputArgumentCollection invrns = new StandardVariantContextInputArgumentCollection();

This base class is all the Java we need. We implement the remaining walker in Clojure and will walk through the fully annotated source in sections. To start, we import the base walker we wrote and extend this to generate a Java class, which the GATK will pick up and make available as a command line walker:

(ns bcbio.variation.vcfwalker
  (:import [bcbio.variation BaseVariantWalker])
   :name bcbio.variation.vcfwalker.VcfSimpleStatsWalker
   :extends bcbio.variation.BaseVariantWalker))

Since this is a Map/Reduce framework, we first need to implement the map function. GATK passes this function a tracker, used to retrieve the actual variant call values and a context which describes the current location. We use the invrns argument we defined in Java to reference the input VCF file supplied on the commandline. Finally, we extract the quality score from each VariantContext and return those. This map function produces a stream of quality scores from the input VCF file:

(defn -map
  [this tracker ref context]
  (if-not (nil? tracker)
    (for [vc (map from-vc
                    (.getValues tracker (.variants (.invrns this))
                                (.getLocation context)))]
      (-> vc :genotypes first :qual))))

For the reduce part, we take the stream of quality scores and plot a histogram. In the GATK this happens in 3 functions: reduceInit starts the reduction step and creates a list to add the quality scores to, reduce collects all of the quality scores into this list, and onTraversalDone plots a histogram of these scores using the Incanter statistical library:

(defn -reduceInit

(defn -reduce
  [this cur coll]
  (if-not (nil? cur)
    (vec (flatten [coll cur]))

(defn -onTraversalDone
  [this result]
  (doto (icharts/histogram result
                           :x-label "Variant quality"
                           :nbins 50)
    (icore/save (.out this) :width 500 :height 400)))

We’ve implemented a full GATK walker in Clojure, taking advantage of existing Clojure plotting libraries. To run this, compile the code into a jarfile and run like a standard GATK tool:

$ lein uberjar
$ java -jar bcbio.variation-0.0.1-SNAPSHOT-standalone.jar -T VcfSimpleStats
  -r test/data/grch37.fa --variant test/data/gatk-calls.vcf --out test.png

which produces a plot of quality score distributions:

GATK walker quality scores

Custom GATK Annotation

GATK’s Variant Annotator is a useful way to add metrics information to a file of variants. These metrics allow filtering and prioritization of variants, either by variant quality score recalibration or hard filtering. We can add new annotation metrics by inheriting from GATK Java interfaces. Here we’ll implement Mean Neighboring Base Quality (NBQ), a metric from the Atlas2 variation suite that assesses the quality scores in a region surrounding a variation.

We start walking through the full implementation by again defining a generated Java class that inherits from a GATK interface. In this case, InfoFieldAnnotation:

(ns bcbio.variation.annotate.nbq
  (:import [org.broadinstitute.sting.gatk.walkers.annotator.interfaces.InfoFieldAnnotation]
           [org.broadinstitute.sting.utils.codecs.vcf VCFInfoHeaderLine VCFHeaderLineType])
  (:require [incanter.stats :as istats])
   :name bcbio.variation.annotate.nbq.MeanNeighboringBaseQuality
   :extends org.broadinstitute.sting.gatk.walkers.annotator.interfaces.InfoFieldAnnotation))

The annotate function does the work of calculating the mean quality score. We define functions that use the GATK API to:

  • Retrieve the pileup at the current position.
  • Get the neighbor qualities from a read at a position.
  • Combine the qualities for all reads in a pileup.

With these three functions, we can use the Clojure threading macro to cleanly organize the steps of the operation as we retrieve the pileup, get the qualities and calculate the mean:

(defn -annotate
  [_ _ _ _ contexts _]
  (letfn [(get-pileup [context]
            (if (.hasExtendedEventPileup context)
              (.getExtendedEventPileup context)
              (.getBasePileup context)))
          (neighbor-qualities [[offset read]]
            (let [quals (-> read .getBaseQualities vec)]
              (map #(nth quals % nil) (range (- offset flank-bp) (+ offset flank-bp)))))
          (pileup-qualities [pileup]
            (map neighbor-qualities (map vector (.getOffsets pileup) (.getReads pileup))))]
    {"NBQ" (->> contexts
                (map get-pileup)
                (map pileup-qualities)
                (remove nil?)
                (format "%.2f"))}))

With this in place we can now run this directly using the standard GATK command line arguments. As before, we create a jar file with the new annotator, and then pass the name as a desired annotation when running the VariantAnnotator, producing a VCF file with NBQ annotations:

$ lein uberjar
$ java -jar bcbio.variation-0.0.1-SNAPSHOT-standalone.jar -T VariantAnnotator
   -A MeanNeighboringBaseQuality -R test/data/GRCh37.fa -I test/data/aligned-reads.bam
   --variant test/data/gatk-calls.vcf -o annotated-file.vcf

Access VCF variant information

In addition to extending the GATK through walkers and annotations you can also utilize the extensive API directly, taking advantage of parsers and data structures to handle common file formats. Using Clojure’s Java interoperability, the variantcontext module provides a high level API to parse and extract information from VCF files. To loop through a VCF file and print the location, reference allele and called alleles for each variant we:

  • Open a VCF source providing access to the underlying file inside a with-open statement to ensure closing of the resource.
  • Parse the VCF source, returning an iterator of VariantContext maps for each variant in the file.
  • Extract values from the map: the chromosome, start, reference allele and called alleles for the first genotype.
(use 'bcbio.variation.variantcontext)

(with-open [vcf-source (get-vcf-source "test/data/gatk-calls.vcf")]
  (doseq [vc (parse-vcf vcf-source)]
    (println (:chr vc) (:start vc) (:ref-allele vc)
             (-> vc :genotypes first :alleles)))

This produces:

MT 73 #<Allele G*> [#<Allele A> #<Allele A>]
MT 150 #<Allele T*> [#<Allele C> #<Allele C>]
MT 152 #<Allele T*> [#<Allele C> #<Allele C>]
MT 195 #<Allele C*> [#<Allele T> #<Allele T>]

I hope this tour provides some insight into the powerful tools that can be rapidly built by leveraging the GATK from Clojure. The full library contains a range of additional functionality including normalization of complex MNPs and support for phased haplotype comparisons.

Written by Brad Chapman

March 4, 2012 at 8:13 pm

Posted in analysis

Tagged with , , ,

Making next-generation sequencing analysis pipelines easier with BioCloudCentral and Galaxy integration

with 15 comments

My previous post described running an automated exome pipeline using CloudBioLinux and CloudMan, and generated incredibly useful feedback. Comments and e-mails pointed out potential points of confusion for new users deploying the process on custom data. I also had the chance to get hands on with researchers running CloudBioLinux and CloudMan during the AWS Genomics Event (talk slides are available).

The culmination of all this feedback are two new development projects from the CloudBioLinux community, aimed at making it easier to run custom analysis pipelines:

  • BioCloudCentral — A web service that launches CloudBioLinux and CloudMan clusters on Amazon Web Services hardware. This removes all of the manual steps involved in setting up security groups and launching a CloudBioLinux instance. A user only needs to sign up for an AWS account; BioCloudCentral takes care of everything else.

  • A custom Galaxy integrated front-end to next-generation sequencing pipelines. A jQuery UI wizard interface manages the intake of sequences and specification of parameters. It runs an automated backend processing pipeline with the structured input data, uploading results into Galaxy data libraries for additional analysis.

Special thanks are due to Enis Afgan for his help building these tools. He provided his boto expertise to the BioCloudCentral Amazon interaction, and generalized CloudMan to support the additional flexibility and automation on display here.

This post describes using these tools to start a CloudMan instance, create an SGE cluster and run a distributed variant calling analysis, all from the browser. The behind the scene details described earlier are available: the piepline uses a CloudBioLinux image containing a wide variety of bioinformatics software and you can use ssh or an NX graphical client to connect to the instance. This is the unique approach behind CloudBioLinux and CloudMan: they provide an open framework for building automated, easy-to-use workflows.

BioCloudCentral — starting a CloudBioLinux instance

To get started, sign up for an Amazon Web services account. This gives you access to on demand computing where you pay per hour of usage. Once signed up, you will need your Access Key ID and Secret Access Key from the Amazon security credentials page.

With these, navigate to BioCloudCentral and fill out the simple entry form. In addition to your access credentials, enter your choice of a name used to identify the cluster, and your choice of password to access the CloudMan web interface and the cluster itself via ssh or NX.

Clicking submit launches a CloudBioLinux server on Amazon. Be careful, since you are now paying per hour for your machine; remember to shut it down when finished.

Before leaving the monitoring page, you want to download a pre-formatted user-data file; this allows you to later start the same CloudMan instance directly from the Amazon web services console.

CloudMan — managing the cluster

The monitoring page on BioCloudCentral provides links directly to the CloudMan web interface. On the welcome page, start a shared CloudMan instance with this identifier:


This shared instance contains the custom Galaxy interface we will use, along with FASTQ sequence files for demonstration purposes. CloudMan will start up the filesystem, SGE, PostgreSQL and Galaxy. Once launched, you can use the CloudMan interface to add additional machines to your cluster for processing.

Galaxy pipeline interface — running the analysis

This Galaxy instance is a fork of the main codebase containing a custom pipeline interface in addition to all of the standard Galaxy tools. It provides an intuitive way to select FASTQ files for processing. Login with the demonstration account (user:; password: example) and load FASTQ files along with target and bait BED files into your active history. Then work through the pipeline wizard step by step to start an analysis:

The Galaxy interface builds a configuration file describing the parameters and inputs, and submits this to the backend analysis server. This server kicks off processing, distributing the analysis across the SGE cluster. For the test data, processing will take approximately 4 hours on a cluster with a single additional work node (Large instance type).

Galaxy — retrieving and displaying results

The analysis pipeline uploads the finalized results into Galaxy data libraries. For this demonstration, the example user has results from a previous run in the data library so you don’t need to wait for the analysis to finish. This folder contains alignment data in BAM format, coverage information in BigWig format, a VCF file of variant calls, a tab separate file with predicted variant effects, and a PDF file of summary information. After importing these into your active Galaxy history, you can perform additional analysis on the data, including visualization in the UCSC genome browser:

As a reminder, don’t forget to terminate your cluster when finished. You can do this either from the CloudMan web interface or the Amazon console.

Analysis pipeline details and extending this work

The backend analysis pipeline is a freely available set of Python modules included on the CloudBioLinux AMI. The pipeline closely follows current best practice variant detection recommendations from the Broad GATK team:

The pipeline framework design is general, allowing incorporation of alternative aligners or variant calling algorithms.

We hope that in addition to being directly useful, this framework can fit within the work environments of other developers. The flexible toolkit used is: CloudBioLinux with open source bioinformatics libraries, CloudMan with a managed SGE cluster, Galaxy with a custom pipeline interface, and finally Python to parallelize and manage the processing. We invite you to fork and extend any of the different components. Thank you again to everyone for the amazing feedback on the analysis pipeline and CloudBioLinux.

Written by Brad Chapman

November 29, 2011 at 8:50 pm

Posted in analysis

Tagged with , , ,

Parallel approaches in next-generation sequencing analysis pipelines

with 2 comments

My last post described a distributed exome analysis pipeline implemented on the CloudBioLinux and CloudMan frameworks. This was a practical introduction to running the pipeline on Amazon resources. Here I’ll describe how the pipeline runs in parallel, specifically diagramming the workflow to identify points of parallelization during lane and sample processing.

Incredible innovation in throughput makes parallel processing critical for next-generation sequencing analysis. When a single Hi-Seq run can produce 192 samples (2 flowcells x 8 lanes per flowcell x 12 barcodes per lane), the analysis steps quickly become limited by the number of processing cores available.

The heterogeneity of architectures utilized by researchers is a major challenge in building re-usable systems. A pipeline needs to support powerful multi-core servers, clusters and virtual cloud-based machines. The approach we took is to scale at the level of individual samples, lanes and pipelines, exploiting the embarassingly parallel nature of the computation. An AMQP messaging queue allows for communication between processes, independent of the system architecture. This flexible approach allows the pipeline to serve as a general framework that can be easily adjusted or expanded to incorporate new algorithms and analysis methods.

Process overview — points for parallel implementations

The first level of parallelization occurs during processing of each fastq lane. We split the file into individualized barcoded components, followed by alignment and BAM processing. The result is a sorted BAM file for each barcoded sub-sample, given a set of input fastq files:

Initial lane processing

The pipeline merges samples present in barcodes on multiple lanes, producing a single representative BAM file. The next step parallelizes the processing of each alignment file with read quality assessment, preparation for visualization and variant calling:

Sample processing overview

The variant calling steps utilize The Genome Analysis Toolkit (GATK) from the Broad Institute. It prepares alignments by recalibrating initial quality scores given the aligned sequences and consistently realigning reads around indels. The Unified Genotyper identifies variants from this prepared alignment file, then uses these variants along with known true sites for assigning quality scores and filtering to a final set of calls:

GATK variant calling details

Subsequent steps include assessment of variant effects using snpEff and haplotype phasing of variants in diploid organism analyses.

Messaging approach to parallel execution

The process diagrams illustrate points of parallel execution for each fastq file and sample analysis. Practically, a top level analysis server manages each of the sub-processes. A command line script, a LIMS system or a specialized Galaxy interface start this top level process. RabbitMQ messaging facilitates communication between the analysis controller and processing nodes:

Messaging approach

In my previous post, CloudMan manages this entire process. The web interface controls a pre-configured SGE cluster and a custom script starts the job on this cluster. However, the general nature of the pipeline architecture allows this to work equally well on multiple core machines or a heterogeneous set of connected machines.

The CloudMan work demonstrates that clusters, especially on-demand virtual images like those available from Amazon, are be a powerful way to scale analyses. Equally important, it provides an open platform to share these pipelines and encourage re-use. The code for the pipeline is available from the bcbio-nextgen GitHub repository

Written by Brad Chapman

September 10, 2011 at 3:12 pm

Distributed exome analysis pipeline with CloudBioLinux and CloudMan

with 19 comments

A major challenge in building analysis pipelines for next-generation sequencing data is combining a large number of processing steps in a flexible, scalable manner. Current best-practice software needs to be installed and configured alongside the custom code to chain individual programs together. Scaling to handle increasing throughput requires running that custom code on a wide variety of parallel architectures, from single multicore machines to heterogeneous clusters.

Establishing community resources that meet the challenges of building these pipelines ensures that bioinformatics programmers can share the burden of building large scale systems. Two open-source efforts which aim at providing this type of architecture are:

  • CloudBioLinux — A community effort to create shared images filled with bioinformatics software and libraries, using an automated build environment.

  • CloudMan — Uses CloudBioLinux as a platform to build a full SGE cluster environment. Written by Enis Afgan and the Galaxy Team, CloudMan is used to provide a ready-to-run, dynamically scalable version of Galaxy on Amazon AWS.

Here we combine CloudBioLinux software with a CloudMan SGE cluster to build a fully automated pipeline for processing high throughput exome sequencing data:

  • The underlying analysis software is from CloudBioLinux.
  • CloudMan provides an SGE cluster managed via a web front end.
  • RabbitMQ is used for communication between cluster nodes.
  • An automated pipeline, written in Python, organizes parallel processing across the cluster.

Below are instructions for starting a cluster on Amazon EC2 resources to run an exome sequencing pipeline that processes FASTQ sequencing reads, producing fully annotated variant calls.

Start cluster with CloudBioLinux and CloudMan

Start in the Amazon web console, a convenient front end for managing EC2 servers. The first step is to follow the CloudMan setup instructions to create an Amazon account and set up appropriate security groups and user data. The wiki page contains detailed screencasts. Below is a short screencast showing how to boot your CloudBioLinux specific CloudMan server:

Once this is booted, proceed to the CloudMan web interface on the server and startup an instance from this shared identifier:


This screencast shows all of the details, including starting an additional node on the SGE cluster:

Configure AMQP messaging

Edit: The AMQP messaging steps have now been full automated so the configuration steps in this section are no longer required. Skip down to the ‘Run Analysis’ section to start processing the data immediately.

With your server booted and ready to run, the next step is to configure RabbitMQ messaging to communicate between nodes on your cluster. In the AWS console, find the external and internal hostname of the head machine. Start by opening an ssh connection to the machine with the external hostname:

$ ssh -i your-keypair

Edit the /export/data/galaxy/universe_wsgi.ini configuration file to add the internal hostname. After editing, the AMQP section will look like:

host = ip-10-125-10-182.ec2.internal
port = 5672
userid = biouser
password = tester

Finally, add the user and virtual host to the running RabbitMQ server on the master node with 3 commands:

$ sudo rabbitmqctl add_user biouser tester
creating user "biouser" ...
$ sudo rabbitmqctl add_vhost bionextgen
creating vhost "bionextgen" ...
$ sudo rabbitmqctl set_permissions -p bionextgen biouser ".*" ".*" ".*"
setting permissions for user "biouser" in vhost "bionextgen" ...

Run analysis

With messaging in place, we are ready to run the analysis. /export/data contains a ready to run example exome analysis, with FASTQ input files in /export/data/exome_example/fastq and configuration information in /export/data/exome_example/config. Start the fully automated pipeline with a single command:

 $ cd /export/data/work
 $ /export/data/galaxy/post_process.yaml
                                   /export/data/exome_example/config/run_info.yaml starts processing servers on each of the cluster nodes, using SGE for scheduling. Then a top level analysis server runs, splitting the FASTQ data across the nodes at each step of the process:

  • Alignment with BWA
  • Preparation of merged alignment files with Picard
  • Recalibration and realignment with GATK
  • Variant calling with GATK
  • Assessment of predicted variant effects with snpEff
  • Preparation of summary PDFs for each sample with read details from FastQC alongside alignment, hybrid selection and variant calling statistics from Picard

Monitor the running process

The example data is from a human chromosome 22 hybrid selection experiment. While running, you can keep track of the progress in several ways. SGEs qstat command will tell you where the analysis servers are running on the cluster:

$ qstat
ob-ID  prior   name   user  state submit/start at   queue
1 0.55500 nextgen_an ubuntu  r  08/14/2011 18:16:32
2 0.55500 nextgen_an ubuntu  r  08/14/2011 18:16:32
3 0.55500 automated_ ubuntu  r  08/14/2011 18:16:47

Listing files in the working directory will show our progress:

$ cd /export/data/work
$ ls -lh
drwxr-xr-x 2 ubuntu ubuntu 4.0K 2011-08-13 21:09 alignments
-rw-r--r-- 1 ubuntu ubuntu 2.0K 2011-08-13 21:17
drwxr-xr-x 2 ubuntu ubuntu   33 2011-08-13 20:43 log
-rw-r--r-- 1 ubuntu ubuntu  15K 2011-08-13 21:17
-rw-r--r-- 1 ubuntu ubuntu  15K 2011-08-13 21:17
drwxr-xr-x 8 ubuntu ubuntu  102 2011-08-13 21:06 tmp

The files that end with .o* are log files from each of the analysis servers and provide detailed information about the current state of processing at each server:

$ less
INFO: nextgen_pipeline: Processing sample: Test replicate 2; lane
  8; reference genome hg19; researcher ; analysis method SNP calling
INFO: nextgen_pipeline: Aligning lane 8_100326_FC6107FAAXX with bwa aligner
INFO: nextgen_pipeline: Combining and preparing wig file [u'', u'Test replicate 2']
INFO: nextgen_pipeline: Recalibrating [u'', u'Test replicate 2'] with GATK

Retrieve results

The processing pipeline results in numerous intermediate files. These take up a lot of disk space and are not necessary after processing is finished. The final step in the process is to extract the useful files for visualization and further analysis:

$ /export/data/galaxy/post_process.yaml

For each sample, this script copies:

  • A BAM file with aligned sequeneces and original FASTQ data
  • A realigned and recalibrated BAM file, ready for variant calling
  • Variant calls in VCF format.
  • A tab delimited file of predicted variant effects.
  • A PDF summary file containing alignment, variant calling and hybrid selection statistics.

into an output directory for the flowcell: /export/data/galaxy/storage/100326_FC6107FAAXX:

$ ls -lh /export/data/galaxy/storage/100326_FC6107FAAXX/7
-rw-r--r-- 1 ubuntu ubuntu  38M 2011-08-19 20:50 7_100326_FC6107FAAXX.bam
-rw-r--r-- 1 ubuntu ubuntu  22M 2011-08-19 20:50 7_100326_FC6107FAAXX-coverage.bigwig
-rw-r--r-- 1 ubuntu ubuntu  72M 2011-08-19 20:51 7_100326_FC6107FAAXX-gatkrecal.bam
-rw-r--r-- 1 ubuntu ubuntu 109K 2011-08-19 20:51 7_100326_FC6107FAAXX-snp-effects.tsv
-rw-r--r-- 1 ubuntu ubuntu 827K 2011-08-19 20:51 7_100326_FC6107FAAXX-snp-filter.vcf
-rw-r--r-- 1 ubuntu ubuntu 1.6M 2011-08-19 20:50 7_100326_FC6107FAAXX-summary.pd

As suggested by the name, the script can also integrate the data into a Galaxy instance if desired. This allows biologists to perform further data analysis, including visual inspection of the alignments in the UCSC browser.

Learn more

All components of the pipeline are open source and part of community projects. CloudMan, CloudBioLinux and the pipeline are customized through YAML configuration files. Combined with the CloudMan managed SGE cluster, the pipeline can be applied in parallel to any number of samples.

The overall goal is to share the automated infrastructure work that moves samples from sequencing to being ready for analysis. This allows biologists more rapid access to the processed data, focusing attention on the real work: answering scientific questions.

If you’d like to hear more about CloudBioLinux, CloudMan and the exome sequencing pipeline, I’ll be discussing it at the AWS Genomics Event in Seattle on September 22nd.

Written by Brad Chapman

August 19, 2011 at 5:33 pm

Summarizing next-gen sequencing variation statistics with Hadoop using Cascalog

with 3 comments

Improvements in next-generation sequencing technology are leading to ever increasing amounts of sequencing data. With this additional throughput comes the demand for algorithms and approaches that can easily scale. Hadoop offers an open source framework for batch processing large files. This post describes using Cascalog, a Hadoop query language written in Clojure, to investigate quality statistics for variant calling in deeply sequenced regions.

Biological question

The goal is to improve a variation calling algorithm for next-generation sequencing data. We have a densely sequenced region, where each position has thousands of potential base calls. Each position may be a single base, or a mix of of majority and minority variants. We are filtering variants on 3 metrics of quality:

  • Quality score — The sequencing technology’s assessment of the correctness of a base.
  • K-mer score — An estimate of the uniqueness of the region surrounding the base call position, built using khmer. Unique regions are more likely to be sequencing artifacts, while common regions are more likely to be real.
  • Mapping score — The aligner’s estimate of the reliability of the read alignment.

Each read and position is in a tab delimited file that looks like:

951     G       31      0.0515130211584 198

The training data has a set of known variable positions, and details about how the current variant calling algorithm did at each position:

951     T       false_positive  0.7
953     A       true_positive   4.0

We wanted to generate summary statistics at each position of interest, and look for additional criteria that could be built into the calling algorithm.

Writing cascalog queries

Cascalog is based on the Datalog rule language, a subset of Prolog. You describe the rules of a system and the query optimizer figures out how best to satisfy them; it requires a change of mindset from the more standard approach that you need to write detailed instructions about what to do.

Cascalog provides a high level of abstraction over Hadoop and Map-Reduce, so you focus entirely on writing the query. This post from Antonio Piccolboni compares several Hadoop languages; the post provides a nice side-by-side example of the brevity you can achieve with Cascalog.

The main query defines the outputs, retrieves input data from the snpdata and location target files described above, provides a count of reads at each position and base of interest, then averages the kmer, quality and mapping score metrics described earlier:

(defn calc-snpdata-stats [snpdata targets]
  (??<- [?chr ?pos ?base ?count ?avg-score ?avg-kmer-pct
         ?avg-qual ?avg-map ?type]
        (snpdata ?chr ?pos ?base ?qual ?kmer-pct ?map-score)
        (targets ?chr ?pos ?base ?type)
        (ops/count ?count)
        (ops/avg ?kmer-pct :> ?avg-kmer-pct)
        (ops/avg ?qual :> ?avg-qual)
        (ops/avg ?map-score :> ?avg-map)
        (combine-score ?kmer-pct ?qual ?map-score :> ?score)
        (ops/avg ?score :> ?avg-score)))

A big advantage of Cascalog is that it is just Clojure, so you can write custom queries in a full-featured language. The last two lines of the query define a custom score and its average at a position. The custom score is a linear combination of the min-max normalized scores:

(defn min-max-norm [score minv maxv]
  (let [trunc-score-max (if (< score maxv) score maxv)
        trunc-score (if (> trunc-score-max minv) trunc-score-max minv)]
    (/ (- trunc-score minv) (- maxv minv))))

(defmapop combine-score [kmer-pct qual map-score]
  (+ (min-max-norm kmer-pct 1e-5 0.10)
     (min-max-norm qual 4.0 35.0)
     (min-max-norm map-score 0.0 250.0)))

The final part of the code involves parsing the files and producing the snpdata and targets inputs to the query. That code splits each line in the input file and assigns the parts to the variables of interest:

(defmapop parse-snpdata-line [line]
  (let [[space pos base qual kmer-pct map-score] (split line #"\t")]
    [space (Integer/parseInt pos) base (Integer/parseInt qual)
     (Float/parseFloat kmer-pct) (Integer/parseInt map-score)]))

(defn snpdata-from-hfs [dir]
    (<- [?chr ?pos ?base ?qual ?kmer-pct ?map-score]
        (source ?line)
        (parse-snpdata-line ?line :> ?chr ?pos ?base ?qual
                                     ?kmer-pct ?map-score))))

Running on Hadoop

The full project is available on GitHub. To run on a configured Hadoop system, you build the code, copy your input files to HDFS, then run:

% lein deps
% lein uberjar
% hadoop fs -mkdir /tmp/snp-assess/data
% hadoop fs -mkdir /tmp/snp-assess/positions
% hadoop fs -put your_variation_data.tsv /tmp/snp-assess/data
% hadoop fs -put positions_of_interest.tsv /tmp/snp-assess/positions
% hadoop jar snp-assess-0.0.1-SNAPSHOT-standalone.jar
             snp_assess.core /tmp/snp-assess/data /tmp/snp-assess/positions

The same code can also run locally without Hadoop. This is extremely useful for testing and development, or for smaller datasets that do not require the distributed power of Hadoop:

% lein deps
% lein run :snp-data /directory/of/varation/data /directory/of/positions

Both approaches generate tabular output with our positions, counts, scores and average metrics:

| 951 | T |  3 | 0.9 | 2.0e-04 | 24.7 | 55.7  | false_positive |
| 953 | A | 10 | 1.5 | 1.6e-02 | 23.1 | 175.5 | true_positive  |

Overview and additional projects

Cascalog provided an easy to use abstraction on top of Hadoop, which enabled exploration of densely mapped next-generation sequencing reads for variant detection. The code is free of scaling specific details, and instead focuses purely on the data of interest.

Another example of Cascalog in a biological setting is the answer I wrote to Pierre’s question on BioStar, dealing with overlapping genomic segments within Hadoop. The code is available from GitHub as an additional starting point for getting oriented with Hadoop and Cascalog.

Written by Brad Chapman

July 4, 2011 at 9:25 pm

Posted in analysis

Tagged with , , , ,

Bioinformatics jobs at Harvard School of Public Health

with 3 comments

I’ve recently moved positions to the bioinformatics core at Harvard School of Public Health. It’s a great place to do science, with plenty of researchers doing interesting work and actively looking for bioinformatics collaborators. The team, working alongside members of the Hide Lab, is passionate about open source work. Both qualities made it a great fit for my interests and experience.

My new group is currently hiring bioinformatics researchers. The work involves interacting collaboratively with a research group to understand their biological problem, creatively attacking the mountains of data underlying the research question, and presenting the results back in an intuitive fashion. On the programming side, it’s an opportunity to combine existing published toolkits with your own custom algorithms and approaches. On the biology side, you should be passionate and interested in thinking of novel ways to advance our understanding of the problems. Practically, all of this work will involve a wide range of technologies and approaches; I expect plenty of next-generation sequencing data and lots of learning about the best ways to scale analyses.

Our other goal is to build re-usable tools for the larger research community. We work extensively with analysis frameworks like Galaxy and open standards like ISA-Tab. We hope to extract the common parts from disparate experiments to build abstractions that help get new analyses done quicker. Tool building also involves automating and deploying analysis pipelines in a way that allows biologists to run them directly. By democratizing analyses and presenting results to researchers at a high level they can directly interact with, science is accelerated and the world becomes an awesomer place.

So if you enjoy the work I write about here, and have always secretly wanted to sit in an office right next to me, now is your big chance (no stalkers, please). If this sounds of interest, please get in touch and I’d be happy to pass along more details.

Written by Brad Chapman

April 10, 2011 at 2:25 pm

Posted in analysis

Tagged with , ,