Blue Collar Bioinformatics

Note: new posts have moved to Please look there for the latest updates and comments

An automated ensemble method for combining and evaluating genomic variants from multiple callers

with 10 comments


A key goal of the Archon Genomics X Prize infrastructure is development of a set of highly accurate reference genome variants. I’ve described our work preparing these reference genomes, and specifically defined the challenges behind merging genomic variant calls from multiple technologies and calling methods. Comparing calls from two different calling methods, for example GATK and samtools mpileup, produces a large number of differing variants which need reconciliation. Taking the overlapping subset from multiple callers is too conservative and will miss real variations, while including all calls is too liberal and introduces false positives.

Here I’ll describe a fully automated approach for preparing an accurate set of combined variant calls. Ensemble machine learning methods are a powerful way to incorporate inputs from multiple models. We use a heuristic and support vector machine (SVM) algorithm to consolidate variants, producing a final set of calls with better sensitivity and specificity than current best practice methods. The approach is open source, fully automated and generalizable to both human diploid sequencing as well as X Prize haploid reference fosmids.

We use a pair of replicates from EdgeBio’s clinical exome sequencing pipeline to prepare ensemble variant calls in the widely studied HapMap NA12878 genome. Compared to variants from a single calling method, the ensemble method produced more concordant variants when comparing the replicates, with fewer discordants. The finalized ensemble calls also provide a useful method to compare strengths and weaknesses of individual calling methods. The implementation is freely available and I’ll discuss how to get it running on your data so you can use, critique and extend the methods. This work is a collaboration between Harvard School of Public Health, EdgeBio and NIST.

Comparison materials and algorithm

A difficult aspect of evaluating variant calling methods is establishing a reference set of calls. For X Prize we use three established methods, each of which comes with tradeoffs. Metrics like transition/transversion ratios or dbSNP overlap provide a global picture of calling but are not fine grained enough to distinguish improvements over best practices. Sanger validation restricts you to a manageable subset of calls. Comparisons against public resources like 1000 genomes bias results towards technologies and callers used in preparing those variant callsets.

Here we employ a fourth method by comparing replicates from EdgeBio’s clinical exome sequencing pipeline. These are NA12878 samples independently prepared using Nimblegen’s version 3.0 kit and sequenced on an Illumina HiSeq. By comparing the replicates in regions with 4 or more reads in both samples, we identify the ability of variant calling algorithms to call identical variations with differing coverage and error profiles.

We aligned reads with novoalign and performed deduplication, base recalibration and realignment using GATK best practices. With these prepared reads, we called variants with five approaches:

  • GATK UnifiedGenotyper – Bayesian approach to call SNPs and indels, treating each position independently.
  • GATK HaplotypeCaller – Performs local de-novo assembly to call SNPs and indels on individual haplotypes.
  • FreeBayes – Bayesian calling approach that handles simultaneous SNPs and indel calling via assessment of regional haplotypes.
  • samtools mpileup – Uses an approach similar to GATK’s UnifiedGenotyper for SNP and indel calling.
  • VarScan – Calls variants using a heuristic/statistic approach eliminating common sources of bias.

We took a combined heuristic and machine learning approach to consolidate these five sets of variant calls into a final ensemble callset. The first step is to prepare the union of all variant calls from the input callers, identifying calling methods that support each variant. Secondly, we annotate each variant with metrics including strand bias, allele balance, regional sequence entropy, position of calls within reads, regional base quality and overall genotype likelihoods. We then filter this prepared set of all possible variants to produce a final set of trusted calls.

The first filtering step is to heuristically identify trusted variants based on the number of callers supporting them. This configurable parameter allow you to make an initial conservative cutoff for including variants in the final calls: I trust variants supported by N or more callers.

For the remaining calls that fall below the trusted support cutoff, we distinguish true and false positives using a support vector machine (SVM). The annotated metrics described above are the input parameters and we prepare true and false positives for the classifier using a multi-step process:

  • Use variants found in all callers as the true positive set, and variants found in a single caller as false positives. Use these training variants to identify an initial set of below-cutoff variants to include and exclude.
  • With this initial set of below-cutoff true/false variants, re-train multiple classifiers stratified based on variant characteristics: variant type (indels vs SNPs), zygosity (heterozygous vs homozygous) and regional sequence complexity.
  • Use these final classifiers to identify included and excluded variants falling below the trusted calling support cutoff.

The final set of calls includes the trusted variants and those that pass the SVM filtering. An input configuration file for variant preparation and assessment allows adjustment of the trusted threshold as well as defining which metrics to use for building the SVM classifiers.

Ensemble calling improvements

We assess calling sensitivity and specificity by comparing concordant and discordant variant calls between the replicates. To provide a consistent method to measure both SNP and indel correctness, we use the positive predictive value as the percentage of concordant calls between duplicates (concordant variants / (concordant variants + discordant variants)). This is different than the overall concordance rate, which also includes non-variant regions where both replicates do not call a variation. As a result these percentages will be lower if you expect the 99% values that result when including reference calls. The advantage of this metric is that it’s easily interpreted as the percentage of concordant called variants. It also allows individual comparisons of SNPs and indels, since the overall number of indels are low compared to the total bases considered. GATK’s VariantEval documentation has a nice discussion of alternative metrics to genotype concordance.

As a baseline we used calls from GATK’s UnifiedGenotyper to represent a current best practice approach. GATK calls 117079 SNPs, 86.6% of which are concordant. It also calls 14966 indels, with 64.6% concordant. Here are the full concordant and discordant numbers, broken down by variant type and replicate:

concordant: total 111159
concordant: SNPs 101495
concordant: indels 9664
rep1 discordant: total 9857
rep1 discordant: SNPs 7514
rep1 discordant: indels 2343
rep2 discordant: total 11029
rep2 discordant: SNPs 8070
rep2 discordant: indels 2959
het/hom discordant 4181

Our ensemble method produces improvements in both total concordant variants detected and the ratio of concordant to discordants. For SNPs, the ensemble calls add 5345 additional variants to a total of 122424, with an increase in concordance to 87.4%. For indels the major improvement is in removal of discordants: We identify 14184 indels with 67.2% concordant. Here is the equivalent table for the ensemble method:

concordant: total 116608
concordant: SNPs 107063
concordant: indels 9545
rep1 discordant: total 9555
rep1 discordant: SNPs 7581
rep1 discordant: indels 1974
rep2 discordant: total 10445
rep2 discordant: SNPs 7780
rep2 discordant: indels 2665
het/hom discordant 3975

For scientists who have worked to increase sensitivity and specificity of individual variant callers, it’s exciting to be able to improve both simultaneously. As mentioned above, you can also tune the method to increase specificity or sensitivity by adjusting the support needed for including trusted variants.

The final ensemble callsets from both replicates are available as VCF files from GenomeSpace in the xprize/NA12878-exome-v_03 folder:

Comparison of calling methods

Calling the same samples with multiple callers allows direct comparisons between calling methods. The advantage of producing an accurate final set of ensemble calls is that this provides a baseline to evaluate the strengths and weaknesses of different calling methods. The figure below compares concordant, missing variants and additional variants called by each of the 5 methods in comparison with the consolidated ensemble calls:

Concordance/discordance for calling methods

  • GATK UnifiedGenotyper provides the best SNP calling, followed closely by samtools mpileup.
  • For indel calling, the GATK HaplotypeCaller produces the most concordant calls followed by UnifiedGenotyper and FreeBayes. UnifiedGenotyper does good as well, but is conservative and has the fewest additional indels. FreeBayes and GATK HaplotypeCaller both provide resolution of individual haplotypes which helps in regions with heterozygous indels or closely spaced SNPs and indels.
  • If you want to use a single variant caller, GATK UnifiedGenotyper does the best overall job.
  • If you wanted to choose free open-source tools for calling, I would recommend samtools for SNP calling and FreeBayes for indel calling.

Variant calling methods with recommendations for both calling and filtering provide the best out of the box performance. An advantage of GATK and samtools is they provide calling, variant quality metrics, and filtering. On the other side, FreeBayes is a good example of a powerful tool that takes some time to learn to filter optimally. One potential source of bias in producing the individual calls is that I personally have more experience with GATK tools so may have room to improve with the other callers.

Availability and usage

Combining multiple calling approaches improves both sensitivity and specificity of the final set of variants. The downside is the need to run and coordinate calls from all of the different callers. To mitigate this, we developed an automated pipeline that ties together multiple open-source tools using two custom components:

  • bcbio-nextgen – A Python framework to run a full sequencing analysis pipeline from input fastq files to consolidated ensemble variant calls. It supports multiple aligners and variant callers, and enables distributed work over multiple cores on a large machine or multiple machines in a cluster environment.
  • bcbio.variation – A Clojure toolkit build on top of GATK’s variant API that provides ensemble call preparation as well as more general functionality for normalizing and comparing variants produced by multiple callers.

bcbio-nextgen has a script, built on functionality in the CloudBioLinux project, that automates installation of associated variant callers and data dependencies:

python install_directory data_directory

With the dependencies installed, you describe the input files and analysis with a YAML formatted input file. The NA12878 ensemble configuration file used for this analysis provides a useful starting point. Run the analysis, distributed on multiple cores, with: bcbio_system.yaml ensemble_sample.yaml -n 8

The bcbio-nextgen documentation provides additional details about configuration inputs and distributed processing. The framework generally handles the automation and processing involved with high throughput sequencing analysis.

EdgeBio kindly made the NA12878 datasets used in this analysis publicly available:

I welcome feedback on the approach, data or tools and am actively working to extend this and make it easier to use. As re-sequencing becomes increasingly important for human health applications it is critical that we develop open, shared best-practice workflows to handle the data processing. This allows us to focus back on the fun and difficult work of understanding the biology.

Written by Brad Chapman

February 6, 2013 at 7:25 am

10 Responses

Subscribe to comments with RSS.

  1. Thank you for the excellent post. It was especially alarming to notice how badly VarScan performs. This is because I have used it quite extensively for multi-sample (tumor + normal) variant and indel calling. However, I assume that there might be some differences between single and multi sample variant calling pipelines. Basically, having two samples to compare, makes the at least a little easier. It would be very interesting to read your thoughts about multi sample calling, since this is the task I’m spending most of my time currently.

    Tommi Vatanen

    February 8, 2013 at 7:14 am

    • Tommi;
      Thanks for the feedback and for bringing up the point about multiple sample and tumor/normal calling. Many callers optimize for these cases so I’d hesitate to generalize the results without doing the corresponding experiment. This is something I’m planning to investigate more since I have projects with multiple sample family calling on-going. More specifically on VarScan, that is the tool I have the least experience with so would welcome ideas to improve my calling and filtering approaches for these single sample cases.

      Brad Chapman

      February 8, 2013 at 7:32 pm

  2. […] See on – Bio-informaticsBlue Collar Bioinformatics …. February 6, 2013 at 7:25 am. Posted in xprize. Tagged with bioinformatics, clinical, clojure, ngs, python, variant, xprize. « Genomics X Prize public phase update: variant classification and de novo calling …See on […]

  3. I really do appreciate these very interesting and detailed comparisons of the callers. It’d be awesome if you want to retry this with the upcoming GATK 2.4 callers — we think we’ve fixed a number of issues with the HaplotypeCaller in particular that should improve quality quite a bit.

    Mark DePristo

    February 8, 2013 at 5:20 pm

    • Mark;
      Thanks much, I’m looking forward to the 2.4 release and will definitely be testing it out. One of the reasons I put so much effort into automation is that algorithms are continuously improving and I hope to keep building on this. Thanks for all the work you do on GATK.

      Brad Chapman

      February 8, 2013 at 7:46 pm

  4. Very interesting article. I’m helping out someone in SNP/SV discovery using tomato genomes (self-taught) and have tried different mappers with samtools mpileup and contrasted them, and there is quite a difference in the SNPs/SV between using different mappers let alone different callers. Most likely due to the different sensitivity etc of the different mappers, I would be curious to see if different mappings could be combined as well but I expect it then becomes to big and unmanageable. I’ve moved on to GATK now I feel more confident and using a filtered SNP/INDEL mpileup file to do the GATK. I will try and test your pipeline out sometime, it looks like a potentially comprehensive approach, I would be tempted to include everything (all SNP/sv after filtering but group them by confidence level by the number of callers rather than lose information.


    July 27, 2013 at 4:31 pm

    • Rob;
      Thanks for the great feedback. It sounds like you are on the same track that led me to explore this. My suggestion to avoid getting mired in the complexity is to try and establish some sort of benchmark of calls you trust and use this to evaluate the aligners and callers. You should fine a set you trust more than others and then can reduce the problem to dealing with a few best-practice approaches. This would also be more ammenable to a set-union approach to incorporating calls from multiple callers since you’d have more confidence in individual calls. Thanks again.

      Brad Chapman

      July 31, 2013 at 11:55 am

  5. This is indeed a very interesting and very useful article, thank you Brad. I was wondering if you have any plans to do comparisons based on pooled samples….currently I am working on pooled bacterial sequences and the outputs I get using samtools and freebayes are very discordant.


    August 13, 2014 at 2:27 am

  6. Eva;
    We’re working on cancer validation right now which has aspects of this with celluarlity and identifiaction of sub-clones and it is a more difficult problem. For something fully pooled like bacterial calling I think you’ll need to expriment with different calling options. FreeBayes does support options for pooled calling and the mailing list has some useful discussions about setting these parameters:!searchin/freebayes/pooled/freebayes/XDgrN527rps/XU7JIZKZdwEJ

    As far a I know samtools is mostly tuned for diploid calling so I’m not sure how well it will characterize the pooled sample. This might be the cause of the discrepencies you’re seeing.

    Sorry to not have more direct experience but hope this helps some.

    Brad Chapman

    August 13, 2014 at 1:12 pm

    • Thank you Brad. Yes I have been experimenting with Freebayes. I will be looking forward to all your future works.


      August 14, 2014 at 1:52 am

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: