Blue Collar Bioinformatics

Note: new posts have moved to Please look there for the latest updates and comments

Posts Tagged ‘clojure

An automated ensemble method for combining and evaluating genomic variants from multiple callers

with 10 comments


A key goal of the Archon Genomics X Prize infrastructure is development of a set of highly accurate reference genome variants. I’ve described our work preparing these reference genomes, and specifically defined the challenges behind merging genomic variant calls from multiple technologies and calling methods. Comparing calls from two different calling methods, for example GATK and samtools mpileup, produces a large number of differing variants which need reconciliation. Taking the overlapping subset from multiple callers is too conservative and will miss real variations, while including all calls is too liberal and introduces false positives.

Here I’ll describe a fully automated approach for preparing an accurate set of combined variant calls. Ensemble machine learning methods are a powerful way to incorporate inputs from multiple models. We use a heuristic and support vector machine (SVM) algorithm to consolidate variants, producing a final set of calls with better sensitivity and specificity than current best practice methods. The approach is open source, fully automated and generalizable to both human diploid sequencing as well as X Prize haploid reference fosmids.

We use a pair of replicates from EdgeBio’s clinical exome sequencing pipeline to prepare ensemble variant calls in the widely studied HapMap NA12878 genome. Compared to variants from a single calling method, the ensemble method produced more concordant variants when comparing the replicates, with fewer discordants. The finalized ensemble calls also provide a useful method to compare strengths and weaknesses of individual calling methods. The implementation is freely available and I’ll discuss how to get it running on your data so you can use, critique and extend the methods. This work is a collaboration between Harvard School of Public Health, EdgeBio and NIST.

Comparison materials and algorithm

A difficult aspect of evaluating variant calling methods is establishing a reference set of calls. For X Prize we use three established methods, each of which comes with tradeoffs. Metrics like transition/transversion ratios or dbSNP overlap provide a global picture of calling but are not fine grained enough to distinguish improvements over best practices. Sanger validation restricts you to a manageable subset of calls. Comparisons against public resources like 1000 genomes bias results towards technologies and callers used in preparing those variant callsets.

Here we employ a fourth method by comparing replicates from EdgeBio’s clinical exome sequencing pipeline. These are NA12878 samples independently prepared using Nimblegen’s version 3.0 kit and sequenced on an Illumina HiSeq. By comparing the replicates in regions with 4 or more reads in both samples, we identify the ability of variant calling algorithms to call identical variations with differing coverage and error profiles.

We aligned reads with novoalign and performed deduplication, base recalibration and realignment using GATK best practices. With these prepared reads, we called variants with five approaches:

  • GATK UnifiedGenotyper – Bayesian approach to call SNPs and indels, treating each position independently.
  • GATK HaplotypeCaller – Performs local de-novo assembly to call SNPs and indels on individual haplotypes.
  • FreeBayes – Bayesian calling approach that handles simultaneous SNPs and indel calling via assessment of regional haplotypes.
  • samtools mpileup – Uses an approach similar to GATK’s UnifiedGenotyper for SNP and indel calling.
  • VarScan – Calls variants using a heuristic/statistic approach eliminating common sources of bias.

We took a combined heuristic and machine learning approach to consolidate these five sets of variant calls into a final ensemble callset. The first step is to prepare the union of all variant calls from the input callers, identifying calling methods that support each variant. Secondly, we annotate each variant with metrics including strand bias, allele balance, regional sequence entropy, position of calls within reads, regional base quality and overall genotype likelihoods. We then filter this prepared set of all possible variants to produce a final set of trusted calls.

The first filtering step is to heuristically identify trusted variants based on the number of callers supporting them. This configurable parameter allow you to make an initial conservative cutoff for including variants in the final calls: I trust variants supported by N or more callers.

For the remaining calls that fall below the trusted support cutoff, we distinguish true and false positives using a support vector machine (SVM). The annotated metrics described above are the input parameters and we prepare true and false positives for the classifier using a multi-step process:

  • Use variants found in all callers as the true positive set, and variants found in a single caller as false positives. Use these training variants to identify an initial set of below-cutoff variants to include and exclude.
  • With this initial set of below-cutoff true/false variants, re-train multiple classifiers stratified based on variant characteristics: variant type (indels vs SNPs), zygosity (heterozygous vs homozygous) and regional sequence complexity.
  • Use these final classifiers to identify included and excluded variants falling below the trusted calling support cutoff.

The final set of calls includes the trusted variants and those that pass the SVM filtering. An input configuration file for variant preparation and assessment allows adjustment of the trusted threshold as well as defining which metrics to use for building the SVM classifiers.

Ensemble calling improvements

We assess calling sensitivity and specificity by comparing concordant and discordant variant calls between the replicates. To provide a consistent method to measure both SNP and indel correctness, we use the positive predictive value as the percentage of concordant calls between duplicates (concordant variants / (concordant variants + discordant variants)). This is different than the overall concordance rate, which also includes non-variant regions where both replicates do not call a variation. As a result these percentages will be lower if you expect the 99% values that result when including reference calls. The advantage of this metric is that it’s easily interpreted as the percentage of concordant called variants. It also allows individual comparisons of SNPs and indels, since the overall number of indels are low compared to the total bases considered. GATK’s VariantEval documentation has a nice discussion of alternative metrics to genotype concordance.

As a baseline we used calls from GATK’s UnifiedGenotyper to represent a current best practice approach. GATK calls 117079 SNPs, 86.6% of which are concordant. It also calls 14966 indels, with 64.6% concordant. Here are the full concordant and discordant numbers, broken down by variant type and replicate:

concordant: total 111159
concordant: SNPs 101495
concordant: indels 9664
rep1 discordant: total 9857
rep1 discordant: SNPs 7514
rep1 discordant: indels 2343
rep2 discordant: total 11029
rep2 discordant: SNPs 8070
rep2 discordant: indels 2959
het/hom discordant 4181

Our ensemble method produces improvements in both total concordant variants detected and the ratio of concordant to discordants. For SNPs, the ensemble calls add 5345 additional variants to a total of 122424, with an increase in concordance to 87.4%. For indels the major improvement is in removal of discordants: We identify 14184 indels with 67.2% concordant. Here is the equivalent table for the ensemble method:

concordant: total 116608
concordant: SNPs 107063
concordant: indels 9545
rep1 discordant: total 9555
rep1 discordant: SNPs 7581
rep1 discordant: indels 1974
rep2 discordant: total 10445
rep2 discordant: SNPs 7780
rep2 discordant: indels 2665
het/hom discordant 3975

For scientists who have worked to increase sensitivity and specificity of individual variant callers, it’s exciting to be able to improve both simultaneously. As mentioned above, you can also tune the method to increase specificity or sensitivity by adjusting the support needed for including trusted variants.

The final ensemble callsets from both replicates are available as VCF files from GenomeSpace in the xprize/NA12878-exome-v_03 folder:

Comparison of calling methods

Calling the same samples with multiple callers allows direct comparisons between calling methods. The advantage of producing an accurate final set of ensemble calls is that this provides a baseline to evaluate the strengths and weaknesses of different calling methods. The figure below compares concordant, missing variants and additional variants called by each of the 5 methods in comparison with the consolidated ensemble calls:

Concordance/discordance for calling methods

  • GATK UnifiedGenotyper provides the best SNP calling, followed closely by samtools mpileup.
  • For indel calling, the GATK HaplotypeCaller produces the most concordant calls followed by UnifiedGenotyper and FreeBayes. UnifiedGenotyper does good as well, but is conservative and has the fewest additional indels. FreeBayes and GATK HaplotypeCaller both provide resolution of individual haplotypes which helps in regions with heterozygous indels or closely spaced SNPs and indels.
  • If you want to use a single variant caller, GATK UnifiedGenotyper does the best overall job.
  • If you wanted to choose free open-source tools for calling, I would recommend samtools for SNP calling and FreeBayes for indel calling.

Variant calling methods with recommendations for both calling and filtering provide the best out of the box performance. An advantage of GATK and samtools is they provide calling, variant quality metrics, and filtering. On the other side, FreeBayes is a good example of a powerful tool that takes some time to learn to filter optimally. One potential source of bias in producing the individual calls is that I personally have more experience with GATK tools so may have room to improve with the other callers.

Availability and usage

Combining multiple calling approaches improves both sensitivity and specificity of the final set of variants. The downside is the need to run and coordinate calls from all of the different callers. To mitigate this, we developed an automated pipeline that ties together multiple open-source tools using two custom components:

  • bcbio-nextgen – A Python framework to run a full sequencing analysis pipeline from input fastq files to consolidated ensemble variant calls. It supports multiple aligners and variant callers, and enables distributed work over multiple cores on a large machine or multiple machines in a cluster environment.
  • bcbio.variation – A Clojure toolkit build on top of GATK’s variant API that provides ensemble call preparation as well as more general functionality for normalizing and comparing variants produced by multiple callers.

bcbio-nextgen has a script, built on functionality in the CloudBioLinux project, that automates installation of associated variant callers and data dependencies:

python install_directory data_directory

With the dependencies installed, you describe the input files and analysis with a YAML formatted input file. The NA12878 ensemble configuration file used for this analysis provides a useful starting point. Run the analysis, distributed on multiple cores, with: bcbio_system.yaml ensemble_sample.yaml -n 8

The bcbio-nextgen documentation provides additional details about configuration inputs and distributed processing. The framework generally handles the automation and processing involved with high throughput sequencing analysis.

EdgeBio kindly made the NA12878 datasets used in this analysis publicly available:

I welcome feedback on the approach, data or tools and am actively working to extend this and make it easier to use. As re-sequencing becomes increasingly important for human health applications it is critical that we develop open, shared best-practice workflows to handle the data processing. This allows us to focus back on the fun and difficult work of understanding the biology.

Written by Brad Chapman

February 6, 2013 at 7:25 am

Genomics X Prize public phase update: variant classification and de novo calling

with 7 comments


Last month I described our work at HSPH and EdgeBio preparing reference genomes for the Archon Genomics X Prize public phase, detailing methods used in the first version of our NA19239 variant calls. We’ve been steadily improving the calling approaches, and released version 0.2 on the X Prize validation website and GenomeSpace. Here I’ll describe the improvements we’ve made over the last month, focusing on two specific areas:

  • De novo calling: Zam Iqbal suggested using his cortex_var de novo variant caller in addition to the current GATK, FreeBayes and samtools callers. With his help, we’ve included these calls in this release, and provide comparisons between de novo and alignment based methods.
  • Improved variant classification: Consolidating variant calls from multiple callers involves making tough choices about when to include or exclude variants. I’ll describe the details of selecting metrics for use in SVM classification and filtering of variants.

Our goal is to iteratively improve our calling and variant preparation to create the best possible set of reference calls. I’d be happy to talk more with anyone working on similar problems or with insight into useful ways to improve our current callsets. We have a Get Satisfaction site for discussion and feedback, and have offered a $2500 prize for helpful comments.

As a reminder, all of the code and data used here is freely available:

  • The variant analysis infrastructure, built on top of GATK, automates genome preparation, normalization and comparison. It provides a full pipeline, driven by simple configuration files, for consolidating multiple variant calls.
  • The combined variant calls, including training data and potential true and false positives, are available from GenomeSpace: Public/chapmanb/xprize/NA19239-v0_2.
  • The individual variant calls for each technology and calling method are also available from GenomeSpace: Public/EdgeBio/PublicData/Release1.

de novo variant calling with cortex_var

de novo variant calling performs reference-free assembly of either local or global genome regions, then subsequently uses these assemblies to call variants relative to a known reference. The advantage is that assemblies can avoid errors associated with mapping to the reference, helping resolve large variations as well as small variations near problem alignments or low complexity regions.

Hybrid approaches that use localized de novo assembly in variant regions help mitigate the extensive computational requirements associated with whole-genome assembly. Complete Genomics variant calling and GATK 2.0’s Haplotype Caller both provide pipelines for hybrid de novo assembly in variant detection. The fermi and SGA assemblers are also used in variant calling, although the paths from assembly to variants are not as automated.

Thanks to Zam’s generous assistance, we used cortex_var for localized de novo assembly and variant calling within individual fosmid boundaries. As a result, CloudBioLinux now contains automated build instructions for cortex_var , handling binary builds for multiple k-mer and color combinations. An automated cortex_var pipeline, part of the bcbio-nextgen toolkit, runs the processing workflow:

  • Start with reads aligned to fosmid regions using novoalign.
  • For each fosmid region, extract all mapped reads along with a local reference genome for variant calling.
  • de novo assemble all reads in the fosmid region and call variants against the local reference genome using cortex_var’s Bubble Caller.
  • Remap regional variant coordinates back to the full genome.
  • Combine all regional calls into the final set of cortex_var calls.

Directly comparing GATK and cortex_var calls shows similar levels of concordance and discordance as the GATK/samtools comparison from the last post:

concordant: total 153787
concordant: SNPs 130913
concordant: indels 22874
GATK discordant: total 20495
GATK discordant: SNPs 6522
GATK discordant: indels 13973
cortex_var discordant: total 26790
cortex_var discordant: SNPs 21342
cortex_var discordant: indels 5448

11% of the GATK calls and 14% of the cortex_var calls are discordant. The one area where cortex_var does especially well is on indels: 19% of the cortex_var indels do not agree with GATK, in comparison with 37% of the GATK calls and 25% of the samtools calls. The current downside to this is SNP calling, where cortex_var has 3 times more discordant calls than GATK.

Selection of classification metrics

Overlapping variant calls from different calling methods (GATK, FreeBayes, samtools and cortex_var) and sequencing technologies (Illumina, SOLiD and IonTorrent) produces 170,286 potential calls in the fosmid regions. 58% (99,227) of these are present in all callers and technologies, so we need to do better than the intersection in creating a consolidated callset.

As detailed in the previous post, we filter the full set in two ways. The first is to keep trusted variants based on their presence in a defined number of technologies or callers. We currently have an inclusive set of criteria: keep variants present in either 4 out of the 7 callsets, 2 distinct technologies, or 3 distinct callers. This creates a trusted set containing 95% (162,202) of the variants. Longer term the goal is to reduce the trusted count and rely on automated filtering approaches based on input metrics.

This second automated filtering step uses a support vector machine (SVM) to evaluate the remaining variants. We train the SVM on potential true positives from variants that overlap in all callers and technologies, and potential false positives found uniquely in one single caller. The challenge is to find useful metrics associated with these training variants that will help provide discrimination.

In version 0.1 we filtered with a vanilla set of metrics: depth and variant quality score. To identify additional metrics, we used a great variant visualization tool developed by Keming Labs alongside HSPH and EdgeBio. I’ll write up more details about the tool once we have a demonstration website but the code is already available on GitHub.

To remove variants preferentially associated with poorly mapping or misaligned reads, we identified two useful metrics. ReadPosEndDist, written as a GATK annotation by Justin Zook at NIST, identifies variants primarily supported by calls at the ends of reads. Based on visual examination, these associate with difficult to map regions, as identified by Genome Mappability Scores:

Secondly, we identified problematic allele balances that differ from the expected ratios. For haploid fosmid calls, we expect 100% of reads to support variants and 0% to support reference calls (in diploid calls, you also need to handle heterozygotes with 50% expected allele balance). In practice, the distribution of reads can differ due to sequencer and alignment errors. We use a metric that measures deviation from the expected allele balance and associates closely with variant likelihoods:

Improved consolidated calls

To assess the influence of adding de novo calls and additional filtering metrics on the resulting call set, we compare against whole genome Illumina and Complete Genomics calls for NA19239. Previously we’d noticed two major issues during this comparison: a high percentage of discordant indel calls and a technology bias signaled by better concordance with Illumina than Complete.

The comparison between fosmid and Illumina data shows a substantial improvement in the indel bias. Previously 46% of the total indel calls were discordant, indicative of a potential false positive problem. With de novo calls and improved filtering, we’ve lowered this to only 10% of the total calls.

concordant: total 147684
concordant: SNPs 133861
concordant: indels 13823
fosmid discordant: total 7519
fosmid discordant: SNPs 5856
fosmid discordant: indels 1663
Illumina discordant: total 5640
Illumina discordant: SNPs 1843
Illumina discordant: indels 3797

This improvement comes with a decrease in the total number of concordant indel calls, since we moved from 17,816 calls to 13,823. However a large number of these seemed to be Illumina specific since 60% of the previous calls were discordant when compared with Complete Genomics. The new callset reduces this discrepancy and only 22% of the indels are now discordant:

concordant: total 139155
concordant: SNPs 127243
concordant: indels 11912
fosmid discordant: total 15484
fosmid discordant: SNPs 12028
fosmid discordant: indels 3456
Complete Genomics discordant: total 7311
Complete Genomics discordant: SNPs 4972
Complete Genomics discordant: indels 2273

These comparisons provide some nice confirmation that we’re moving in the right direction on filtering. As before, we extract potential false positives and false negatives to continue to refine the calls: potential false positives are those called in the fosmid dataset and in neither of the Illumina or Complete Genomics sets. Potential false negatives are calls that both Illumina and Complete agree on that the fosmid calls lack.

In the new callsets, there are 5,499 (3.5%) potential false positives and 1,422 (0.9%) potential false negatives. We’ve reduced potential false positives in the previous set from 10% with a slight increase in false negatives. These subsets are available along with the full callset on GenomeSpace. We’re also working hard on an NA12878 callset with equivalent approaches and will make that available soon for community feedback.

I hope this discussion, open source code, and dataset release is useful to everyone working on problems of improving variant calling accuracy and filtering. I welcome feedback on calling, consolidation methods, interesting metrics to explore, machine learning or any of the other topics discussed here.

Written by Brad Chapman

September 17, 2012 at 8:41 am

Extending the GATK for custom variant comparisons using Clojure

with 2 comments

The Genome Analysis Toolkit (GATK) is a full-featured library for dealing with next-generation sequencing data. The open-source Java code base, written by the Genome Sequencing and Analysis Group at the Broad Institute, exposes a Map/Reduce framework allowing developers to code custom tools taking advantage of support for: BAM Alignment files through Picard, BED and other interval file formats through Tribble, and variant data in VCF format.

Here I’ll show how to utilize the GATK API from Clojure, a functional, dynamic programming language that targets the Java Virtual Machine. We’ll:

  • Write a GATK walker that plots variant quality scores using the Map/Reduce API.
  • Create a custom annotation that adds a mean neighboring base quality metric using the GATK VariantAnnotator.
  • Use the VariantContext API to parse and access variant information in a VCF file.

The Clojure variation library is freely available and is part of a larger project to provide variant assessment capabilities for the Archon Genomics XPRIZE competition.

Map/Reduce GATK Walker

GATK’s well documented Map/Reduce API eases the development of custom programs for processing BAM and VCF files. The presentation from Eli Lilly is a great introduction to developing your own custom GATK Walkers in Java. Here we’ll follow a similar approach to code these in Clojure.

We’ll start by defining a simple Java base class that extends the base GATK walker and defines an output and input variable. The output is a string specifying the output file to write to, and the input is any type of variant file the GATK accepts. Here we’ll be dealing with VCF input files:

public abstract class BaseVariantWalker extends RodWalker {
  public String out;

  public StandardVariantContextInputArgumentCollection invrns = new StandardVariantContextInputArgumentCollection();

This base class is all the Java we need. We implement the remaining walker in Clojure and will walk through the fully annotated source in sections. To start, we import the base walker we wrote and extend this to generate a Java class, which the GATK will pick up and make available as a command line walker:

(ns bcbio.variation.vcfwalker
  (:import [bcbio.variation BaseVariantWalker])
   :name bcbio.variation.vcfwalker.VcfSimpleStatsWalker
   :extends bcbio.variation.BaseVariantWalker))

Since this is a Map/Reduce framework, we first need to implement the map function. GATK passes this function a tracker, used to retrieve the actual variant call values and a context which describes the current location. We use the invrns argument we defined in Java to reference the input VCF file supplied on the commandline. Finally, we extract the quality score from each VariantContext and return those. This map function produces a stream of quality scores from the input VCF file:

(defn -map
  [this tracker ref context]
  (if-not (nil? tracker)
    (for [vc (map from-vc
                    (.getValues tracker (.variants (.invrns this))
                                (.getLocation context)))]
      (-> vc :genotypes first :qual))))

For the reduce part, we take the stream of quality scores and plot a histogram. In the GATK this happens in 3 functions: reduceInit starts the reduction step and creates a list to add the quality scores to, reduce collects all of the quality scores into this list, and onTraversalDone plots a histogram of these scores using the Incanter statistical library:

(defn -reduceInit

(defn -reduce
  [this cur coll]
  (if-not (nil? cur)
    (vec (flatten [coll cur]))

(defn -onTraversalDone
  [this result]
  (doto (icharts/histogram result
                           :x-label "Variant quality"
                           :nbins 50)
    (icore/save (.out this) :width 500 :height 400)))

We’ve implemented a full GATK walker in Clojure, taking advantage of existing Clojure plotting libraries. To run this, compile the code into a jarfile and run like a standard GATK tool:

$ lein uberjar
$ java -jar bcbio.variation-0.0.1-SNAPSHOT-standalone.jar -T VcfSimpleStats
  -r test/data/grch37.fa --variant test/data/gatk-calls.vcf --out test.png

which produces a plot of quality score distributions:

GATK walker quality scores

Custom GATK Annotation

GATK’s Variant Annotator is a useful way to add metrics information to a file of variants. These metrics allow filtering and prioritization of variants, either by variant quality score recalibration or hard filtering. We can add new annotation metrics by inheriting from GATK Java interfaces. Here we’ll implement Mean Neighboring Base Quality (NBQ), a metric from the Atlas2 variation suite that assesses the quality scores in a region surrounding a variation.

We start walking through the full implementation by again defining a generated Java class that inherits from a GATK interface. In this case, InfoFieldAnnotation:

(ns bcbio.variation.annotate.nbq
  (:import [org.broadinstitute.sting.gatk.walkers.annotator.interfaces.InfoFieldAnnotation]
           [org.broadinstitute.sting.utils.codecs.vcf VCFInfoHeaderLine VCFHeaderLineType])
  (:require [incanter.stats :as istats])
   :name bcbio.variation.annotate.nbq.MeanNeighboringBaseQuality
   :extends org.broadinstitute.sting.gatk.walkers.annotator.interfaces.InfoFieldAnnotation))

The annotate function does the work of calculating the mean quality score. We define functions that use the GATK API to:

  • Retrieve the pileup at the current position.
  • Get the neighbor qualities from a read at a position.
  • Combine the qualities for all reads in a pileup.

With these three functions, we can use the Clojure threading macro to cleanly organize the steps of the operation as we retrieve the pileup, get the qualities and calculate the mean:

(defn -annotate
  [_ _ _ _ contexts _]
  (letfn [(get-pileup [context]
            (if (.hasExtendedEventPileup context)
              (.getExtendedEventPileup context)
              (.getBasePileup context)))
          (neighbor-qualities [[offset read]]
            (let [quals (-> read .getBaseQualities vec)]
              (map #(nth quals % nil) (range (- offset flank-bp) (+ offset flank-bp)))))
          (pileup-qualities [pileup]
            (map neighbor-qualities (map vector (.getOffsets pileup) (.getReads pileup))))]
    {"NBQ" (->> contexts
                (map get-pileup)
                (map pileup-qualities)
                (remove nil?)
                (format "%.2f"))}))

With this in place we can now run this directly using the standard GATK command line arguments. As before, we create a jar file with the new annotator, and then pass the name as a desired annotation when running the VariantAnnotator, producing a VCF file with NBQ annotations:

$ lein uberjar
$ java -jar bcbio.variation-0.0.1-SNAPSHOT-standalone.jar -T VariantAnnotator
   -A MeanNeighboringBaseQuality -R test/data/GRCh37.fa -I test/data/aligned-reads.bam
   --variant test/data/gatk-calls.vcf -o annotated-file.vcf

Access VCF variant information

In addition to extending the GATK through walkers and annotations you can also utilize the extensive API directly, taking advantage of parsers and data structures to handle common file formats. Using Clojure’s Java interoperability, the variantcontext module provides a high level API to parse and extract information from VCF files. To loop through a VCF file and print the location, reference allele and called alleles for each variant we:

  • Open a VCF source providing access to the underlying file inside a with-open statement to ensure closing of the resource.
  • Parse the VCF source, returning an iterator of VariantContext maps for each variant in the file.
  • Extract values from the map: the chromosome, start, reference allele and called alleles for the first genotype.
(use 'bcbio.variation.variantcontext)

(with-open [vcf-source (get-vcf-source "test/data/gatk-calls.vcf")]
  (doseq [vc (parse-vcf vcf-source)]
    (println (:chr vc) (:start vc) (:ref-allele vc)
             (-> vc :genotypes first :alleles)))

This produces:

MT 73 #<Allele G*> [#<Allele A> #<Allele A>]
MT 150 #<Allele T*> [#<Allele C> #<Allele C>]
MT 152 #<Allele T*> [#<Allele C> #<Allele C>]
MT 195 #<Allele C*> [#<Allele T> #<Allele T>]

I hope this tour provides some insight into the powerful tools that can be rapidly built by leveraging the GATK from Clojure. The full library contains a range of additional functionality including normalization of complex MNPs and support for phased haplotype comparisons.

Written by Brad Chapman

March 4, 2012 at 8:13 pm

Posted in analysis

Tagged with , , ,

Summarizing next-gen sequencing variation statistics with Hadoop using Cascalog

with 3 comments

Improvements in next-generation sequencing technology are leading to ever increasing amounts of sequencing data. With this additional throughput comes the demand for algorithms and approaches that can easily scale. Hadoop offers an open source framework for batch processing large files. This post describes using Cascalog, a Hadoop query language written in Clojure, to investigate quality statistics for variant calling in deeply sequenced regions.

Biological question

The goal is to improve a variation calling algorithm for next-generation sequencing data. We have a densely sequenced region, where each position has thousands of potential base calls. Each position may be a single base, or a mix of of majority and minority variants. We are filtering variants on 3 metrics of quality:

  • Quality score — The sequencing technology’s assessment of the correctness of a base.
  • K-mer score — An estimate of the uniqueness of the region surrounding the base call position, built using khmer. Unique regions are more likely to be sequencing artifacts, while common regions are more likely to be real.
  • Mapping score — The aligner’s estimate of the reliability of the read alignment.

Each read and position is in a tab delimited file that looks like:

951     G       31      0.0515130211584 198

The training data has a set of known variable positions, and details about how the current variant calling algorithm did at each position:

951     T       false_positive  0.7
953     A       true_positive   4.0

We wanted to generate summary statistics at each position of interest, and look for additional criteria that could be built into the calling algorithm.

Writing cascalog queries

Cascalog is based on the Datalog rule language, a subset of Prolog. You describe the rules of a system and the query optimizer figures out how best to satisfy them; it requires a change of mindset from the more standard approach that you need to write detailed instructions about what to do.

Cascalog provides a high level of abstraction over Hadoop and Map-Reduce, so you focus entirely on writing the query. This post from Antonio Piccolboni compares several Hadoop languages; the post provides a nice side-by-side example of the brevity you can achieve with Cascalog.

The main query defines the outputs, retrieves input data from the snpdata and location target files described above, provides a count of reads at each position and base of interest, then averages the kmer, quality and mapping score metrics described earlier:

(defn calc-snpdata-stats [snpdata targets]
  (??<- [?chr ?pos ?base ?count ?avg-score ?avg-kmer-pct
         ?avg-qual ?avg-map ?type]
        (snpdata ?chr ?pos ?base ?qual ?kmer-pct ?map-score)
        (targets ?chr ?pos ?base ?type)
        (ops/count ?count)
        (ops/avg ?kmer-pct :> ?avg-kmer-pct)
        (ops/avg ?qual :> ?avg-qual)
        (ops/avg ?map-score :> ?avg-map)
        (combine-score ?kmer-pct ?qual ?map-score :> ?score)
        (ops/avg ?score :> ?avg-score)))

A big advantage of Cascalog is that it is just Clojure, so you can write custom queries in a full-featured language. The last two lines of the query define a custom score and its average at a position. The custom score is a linear combination of the min-max normalized scores:

(defn min-max-norm [score minv maxv]
  (let [trunc-score-max (if (< score maxv) score maxv)
        trunc-score (if (> trunc-score-max minv) trunc-score-max minv)]
    (/ (- trunc-score minv) (- maxv minv))))

(defmapop combine-score [kmer-pct qual map-score]
  (+ (min-max-norm kmer-pct 1e-5 0.10)
     (min-max-norm qual 4.0 35.0)
     (min-max-norm map-score 0.0 250.0)))

The final part of the code involves parsing the files and producing the snpdata and targets inputs to the query. That code splits each line in the input file and assigns the parts to the variables of interest:

(defmapop parse-snpdata-line [line]
  (let [[space pos base qual kmer-pct map-score] (split line #"\t")]
    [space (Integer/parseInt pos) base (Integer/parseInt qual)
     (Float/parseFloat kmer-pct) (Integer/parseInt map-score)]))

(defn snpdata-from-hfs [dir]
    (<- [?chr ?pos ?base ?qual ?kmer-pct ?map-score]
        (source ?line)
        (parse-snpdata-line ?line :> ?chr ?pos ?base ?qual
                                     ?kmer-pct ?map-score))))

Running on Hadoop

The full project is available on GitHub. To run on a configured Hadoop system, you build the code, copy your input files to HDFS, then run:

% lein deps
% lein uberjar
% hadoop fs -mkdir /tmp/snp-assess/data
% hadoop fs -mkdir /tmp/snp-assess/positions
% hadoop fs -put your_variation_data.tsv /tmp/snp-assess/data
% hadoop fs -put positions_of_interest.tsv /tmp/snp-assess/positions
% hadoop jar snp-assess-0.0.1-SNAPSHOT-standalone.jar
             snp_assess.core /tmp/snp-assess/data /tmp/snp-assess/positions

The same code can also run locally without Hadoop. This is extremely useful for testing and development, or for smaller datasets that do not require the distributed power of Hadoop:

% lein deps
% lein run :snp-data /directory/of/varation/data /directory/of/positions

Both approaches generate tabular output with our positions, counts, scores and average metrics:

| 951 | T |  3 | 0.9 | 2.0e-04 | 24.7 | 55.7  | false_positive |
| 953 | A | 10 | 1.5 | 1.6e-02 | 23.1 | 175.5 | true_positive  |

Overview and additional projects

Cascalog provided an easy to use abstraction on top of Hadoop, which enabled exploration of densely mapped next-generation sequencing reads for variant detection. The code is free of scaling specific details, and instead focuses purely on the data of interest.

Another example of Cascalog in a biological setting is the answer I wrote to Pierre’s question on BioStar, dealing with overlapping genomic segments within Hadoop. The code is available from GitHub as an additional starting point for getting oriented with Hadoop and Cascalog.

Written by Brad Chapman

July 4, 2011 at 9:25 pm

Posted in analysis

Tagged with , , , ,