Validating generalized incremental joint variant calling with GATK HaplotypeCaller, FreeBayes, Platypus and samtools

Incremental joint variant calling

Variant calling in large populations is challenging due to the difficulty in providing a consistent set of calls at all possible variable positions. A finalized set of calls from a large population should distinguish reference calls, without a variant, from no calls, positions without enough read support to make a call. Calling algorithms should also be able to make use of information from other samples in the population to improve sensitivity and precision.

There are two issues with trying to provide complete combined call sets. First, it is computationally expensive to call a large number of samples simultaneously. Second, adding any new samples to a callset requires repeating this expensive computation. This N+1 problem highlights the inflexibility around simultaneous pooled calling of populations.

The GATK team’s recent 3.x release has a solution to these issues: Incremental joint variant discovery. The approach calls samples independently but produces a genomic VCF (gVCF) output for each individual that contains probability information for both variants and reference calls at non-variant positions. The genotyping step combines these individual gVCF files, making use of the information from the independent samples to produce a final callset.

We added GATK incremental joint calling to bcbio-nextgen along with a generalized implementation that performs joint calling with other variant callers. Practically, bcbio now supports this approach with four variant callers:

GATK HaplotypeCaller (3.2-2) – Follows current GATK recommended best practices for calling, with Variant Quality Score Recalibration used on whole genome and large population callsets. This uses individual sample gVCFs as inputs to joint calling.
FreeBayes (0.9.14-15) – A haplotype-based caller from Erik Garrison in Gabor Marth’s lab.
Platypus (0.7.9.2) – A recently published haplotype-based variant caller from Andy Rimmer at the Wellcome Trust Centre for Human Genomics.
samtools (1.0) – The recently released version of samtools and bcftools with a new multiallelic calling method. John Marshall, Petr Danecek, James Bonfield and Martin Pollard at Sanger have continued samtools development from Heng Li’s code base.

The implementation includes integrated validation against the Genome in a Bottle NA12878 reference standard, allowing comparisons between joint calling, multi-sample pooled calling and single sample calling. Sensitivity and precision for joint calling is comparable to pooled calling, suggesting we should optimize design of variant processing to cater towards individual calling and subsequent finalization of calls, rather than pooling. Generalized joint calling enables combining multiple sets of calls under an identical processing framework, which will be important as we seek to integrate large publicly available populations to extract biological signal in complex multi-gene diseases.

Terminology

There is not a consistent set of terminology around combined variant calling, but to develop one, here is how I’ll use the terms:

Joint calling – Calling a group of samples together with algorithms that do not need simultaneous access to all population BAM files. GATK’s incremental joint calling uses gVCF intermediates. Our generalized implementation performs recalling using individual BAMs supplemented with a combined VCF file of variants called in all samples.
Pooled or batch calling – Traditional grouped sample calling, where algorithms make use of read data from all BAM files of a group. This scales to smaller batches of samples.
Single sample calling – Variant calling with a single sample only, not making use of information from other samples.
Squaring off or Backfilling – Creating a VCF file from a group of samples that distinguishes reference from no-call at every position called as a variant in one of the samples. With a squared off VCF, we can use the sample matrix to consider call rate at any position. Large populations called in smaller batches will not be able to distinguish reference from no-call at variants unique to each sub-pool, so will need to be re-processed to achieve this.

Implementation

bcbio-nextgen automates the calling and validation used in this comparison. We aim to make it easy to install, use and extend.

For GATK HaplotypeCaller based joint genotyping, we implement the GATK best practices recommended by the Broad. Individual sample variant calls produce a gVCF output file that contains both variants as well as probability information about reference regions. Next, variants are jointly called using GenotypeGVFs to produce the final population VCF file.

For the other supported callers – FreeBayes, Platypus and samtools – we use a generalized recalling approach, implemented in bcbio.variation.recall. bcbio-nextgen first calls each individual sample as a standard VCF. We then combine these individual sample VCFs into a global summary of all variant positions called across all samples. Finally we recall at each potential variant position, producing a globally squared off joint callset for the sample that we merge into the final joint VCF. This process parallelizes by chromosome region and by sample, allowing efficient use of resources in both clusters and large multiple core machines.

bcbio.variation.recall generalizes to any variant caller that supports recalling with an input set of variants. Knowing the context of potential variants helps inform better calling. This method requires having the individual sample BAM file available to perform recalling. Having the reads present does provide the ability to improve recalling by taking advantage of realigning reads into haplotypes given known variants, an approach we’ll explore more in future work. The implementation is also general and could support gVCF based combining as this becomes available for non-GATK callers.

Generalized joint calling

We evaluated all callers against the NA12878 Genome in a Bottle reference standard using the NA12878/NA12891/NA12892 trio from the CEPH 1463 Pedigree, with 50x whole genome coverage from Illumina’s platinum genomes. The validation provides putative true positives (concordant), false negatives (discordant missing), and false positives (discordant extra) for all callers:

Overall, there is not a large difference in sensitivity and precision for the four methods, giving us four high-quality options for performing joint variant calling on germline samples. The post-calling filters provide similar levels of false positives to enable comparisons of sensitivity. Notably, samtools new calling method is now as good as other approaches, in contrast with previous evaluations, demonstrating the value of continuing to improve open source tools and having updated benchmarks to reflect these improvements.

Improving sensitivity and precision is always an ongoing process and this evaluation identifies some areas to focus on for future work:

Platypus SNP and indel calling is slightly less sensitive than other approaches. We worked on Platypus calling parameters and post-call filtering to increase sensitivity from the defaults without introducing a large number of false positives, but welcome suggestions for more improvements.
samtools indel calling needs additional work to reduce false positive indels in joint and pooled calling. There is more detail on this below in the comparison with single sample samtools calling.

Joint versus pooled versus single approaches

We validated the same NA12878 trio with pooled and single sample calling to assess the advantages of joint calling over single sample, and whether joint calling is comparable in quality to calling simultaneously. The full evaluation for pooled calling shows that performance is similar to joint methods:

If you plot joint, pooled and single sample calling next to each other there are some interesting small differences between approaches that identify areas for further improvement. As an example, here are GATK HaplotypeCaller and samtools with the three approaches presented side by side:

GATK HaplotypeCaller sensitivity and precision are close between the three methods, with small trade offs for different methods. For SNPs, pooled calling is most sensitive at the cost of more false positives, and single calling is more precise at the cost of some sensitivity. Joint calling is intermediate between these two extremes. For indels, joint calling is the most sensitive at the cost of more false positives, with pooled calling falling between joint and single sample calling.

For samtools, precision is currently best tuned for single sample calling. Pooled calling provides better sensitivity, but at the cost of a larger number of false positives. The joint calling implementation regains a bit of this sensitivity but still suffers from increased false positives. The authors of samtools tuned variant calling nicely for single samples, but there are opportunities to increase sensitivity when incorporating multiple samples via a joint method.

Generally, we don’t expect the same advantages for pooled or joint calling in a trio as we’d see in a larger population. However, even for this small evaluation population we can see the improvements available by considering additional variant information from other samples. For Platypus we unexpectedly had better calls from joint calling compared to pooled calling, but expect these differences to harmonize over time as the tools continue to improve.

Overall, this comparison identifies areas where we can hope to improve generalized joint calling. We plan to provide specific suggestions and feedback to samtools, Platypus and other tool authors as part of a continuous validation and feedback process.

Reproducing and extending the analysis

All variant callers and calling methods validated here are available for running in bcbio-nextgen. bcbio automatically installs the generalized joint calling implementation, and it is also available as a java executable at bcbio.variation.recall. All tools are freely available, open source and community developed and we welcome your feedback and contributions.

The documentation contains full instructions for running the joint analysis. This is an extended version of previous work on validation of trio calling and uses the same input dataset with a bcbio configuration that includes single, pooled and joint calling:

mkdir -p NA12878-trio-eval/config NA12878-trio-eval/input NA12878-trio-eval/work-joint
cd NA12878-trio-eval/config
cd ../input
wget https://raw.github.com/chapmanb/bcbio-nextgen/master/config/examples/NA12878-trio-wgs-validate-getdata.sh
bash NA12878-trio-wgs-validate-getdata.sh
wget https://raw.github.com/chapmanb/bcbio-nextgen/master/config/examples/NA12878-trio-wgs-joint.yaml
cd ../work_joint
bcbio_nextgen.py ../config/NA12878-trio-wgs-joint.yaml -n 16

Having a general joint calling implementation with good sensitivity and precision is a starting point for more research and development. To build off this work we plan to:

Provide better ensemble calling methods that scale to large multi-sample calling projects.
Work with FreeBayes, Platypus and samtools tool authors to provide support for gVCF style files to avoid the need to have BAM files present during joint calling, and to improve sensitivity and precision during recalling-based joint approaches.
Combine variant calls with local reassembly to improve sensitivity and precision. Erik Garrison’s glia provides streaming local realignment given a set of potential variants. Jared Simpson used the SGA assembler to combine FreeBayes calls with de-novo assembly. Ideally we could identify difficult regions of the genome based on alignment information and focus more computationally expensive assembly approaches there.

We plan to continue working with the open source scientific community to integrate, extend and improve these tools and are happy for any feedback and suggestions.

Written by Brad Chapman

October 7, 2014 at 8:53 am

Posted in variation

Tagged with bioinformatics, freebayes, gatk, joint-calling, ngs, platypus, samtools, validation, variant

20 Responses

Subscribe to comments with RSS.

Are the resulting vcfs available on the 1000g ftp or elsewhere? Thanks.

Albert Vilella (@AlbertVilella)

October 7, 2014 at 1:44 pm

Reply
- Albert;
  I’d be happy to upload results to S3 to make this easier to build off of. What calls in particular are you interested in?
  
  Brad Chapman
  
  October 8, 2014 at 3:45 pm
  
  Reply
By joint calling do you mean pedigree-driven logic?

Honestly I don’t really understand the popularity of joint calling against whatever biased cohort winds up in a run at a particular time.

Jeremy Leipzig

October 9, 2014 at 11:34 am

Reply
- Jeremy;
  This doesn’t have any pedigree-driven algorithms, and is rather using the other variants in the current cohort to inform calling. I’m agreed that is makes sense to think about designing the population to include family members and others with the same genetic background. This is a step in the right direction towards having a more generalized graph-based approach to the reference genome, since the variants allow you to specifically interrogate haplotypes not necessarily found in the reference. I’m hopeful the realignment approaches, coupled with improved reference genomes, will help with these situations in the short term. Thanks much for the thoughts and discussion.
  
  Brad Chapman
  
  October 11, 2014 at 5:50 am
  
  Reply
Thanks for all the hard work! I have 3 questions:
1. Does this approach work equally well on whole exome data?
2. Have you considered integrating ABRA for assembly-based realignment? You mentioned it on twitter sometime back.
3. Could you please point me to the possible regions to exclude that are considered hard to call? I can’t find the GIAB bed files.

Johnbosco Tayebwa (@CanGenom)

October 13, 2014 at 8:41 am

Reply
- Johnbosco;
  Thanks much for the questions. This does work equally well on exome data. We’ve mostly been focusing on whole genome data due to also wanted to tackle the scaling issues involved with that but that doesn’t preclude it from working on targetted examples.
  
  We’d like to evaluate ABRA, especially in the context of indel calling in tumor/normal pairings. It is just something we haven’t had time to work on yet. Contributions are definitely welcome.
  
  For excluded regions from GiaB, the small script for setting up the validation has the URL and wget command: https://github.com/chapmanb/bcbio-nextgen/blob/master/config/examples/NA12878-trio-wgs-validate-getdata.sh. We also remove low complexity regions (LCRs) and this post has pointers to the BED files for that: https://bcbio.wordpress.com/2014/05/12/wgs-trio-variant-evaluation/.
  
  Brad Chapman
  
  October 14, 2014 at 8:01 am
  
  Reply
Thanks for this great post.

Could you say a bit more about how the final VCF merging step of the generalised approach? In particular, how do you deal with variant annotations such as MQ or QD which need to be aggregated across samples?

Alistair Miles

October 14, 2014 at 4:08 am

Reply
- Alistair;
  Thanks much. In general we make use of other tools when possible, so we don’t implement this ourselves. For merging we use GATK’s CombineVariants (https://github.com/chapmanb/bcbio.variation.recall/blob/2478a4a8ead855287cb1b109976f3136bd5db6f1/src/bcbio/variation/recall/merge.clj#L65) so inherit all of the benefits and limitations of their tool. For your specific examples, GATK will drop MQ and QD if they’re present in multiple samples so I wouldn’t use them for downstream filtering since they won’t always be available. If you know of tools that handle this automatically we’d be happy to incorporate them. Thanks again.
  
  Brad Chapman
  
  October 14, 2014 at 8:18 am
  
  Reply
[…] RT @dgmacarthur: ICYMI genomics types – this post by @chapmanb comparing variant calling methods is extremely useful: https://bcbio.wordpress.com/2014/10/07/joint-calling/ […]

Friday SNPpets | The OpenHelix Blog

October 17, 2014 at 9:45 am

Reply
It would be great to have all vcf.gz files somewhere to download. Ideally the highest scoring ones first, if space is an issue. I believe Illumina BaseSpace allows for vcf uploads (registration required) and sharing.

Albert

October 23, 2014 at 9:46 am

Reply
- Albert;
  I put the final callsets from all 4 callers on an S3 bucket here:
  
  http://bcbio.s3-website-us-east-1.amazonaws.com/jointeval/
  
  Let me know if you have any questions. Thanks much for following up with these.
  
  Brad Chapman
  
  October 24, 2014 at 10:44 am
  
  Reply
Hi Brad, thanks for uploading the vcfs. Can you tell me how many lanes were used for the bams in these analysis? Are these from the 4 lane NA12878 Illumina Platinum Genomes runs uploaded to 1000G?

castanyesblaves

November 3, 2014 at 10:50 am

Reply
- Albert;
  There are 50x coverage inputs from Platinum Genomes (http://www.illumina.com/platinumgenomes/) available in SRA. The specific wgets to retrieve the fastqs are here: https://github.com/chapmanb/bcbio-nextgen/blob/master/config/examples/NA12878-trio-sv-getdata.sh Thanks much
  
  Brad Chapman
  
  November 4, 2014 at 3:01 pm
  
  Reply
Hi Brad,

Very nice analysis. Can you tell me if Discordant(shared) means true-negative?

If it is indeed true negative, how come specificity here is very different from the “minimal bam preparation pipeline” post?

Thanks a lot

Geneticist from the East

February 6, 2015 at 4:02 am

Reply
- Thanks for the question and happy to hear the post is useful. Discordant (shared) variants are positions where they are called as a variant in both the reference standard and evaluation calls, but the calls differ. For SNPs, this is generally a heterozygote/homozygote difference and for indels the two calls often overlap but are not the exact bases.
  
  The difference in numbers between this and the minimal post is the scale. This is whole genome and the minimal BAM preparation analysis is on exome. We’ve worked a lot on scaling in the meantime and able to run these reasonably fast on whole genomes. Hope this helps.
  
  Brad Chapman
  
  February 10, 2015 at 1:51 pm
  
  Reply
[…] come with a 60% false discovery rate. 60%! Wow….. of course that was written in 2013, and in 2014 a comparison of a more recent version of SAMtools shows a better performance (though still an inflated false positive […]

The five habits of bad bioinformaticians | opiniomics

November 16, 2015 at 7:35 am

Reply
Hello Brad,

great post as usual. Can you tell me where I might find the exact commands used for Freebayes and GATK in this comparison? Thank you.

Eva

November 26, 2015 at 5:47 am

Reply
- Eva;
  Glad the post is useful. Are you looking for initial calling commands or the joint recalling steps? For the initial calls, both are in bcbio-nextgen (FreeBayes: https://github.com/chapmanb/bcbio-nextgen/blob/9b3e9e167cc26be203d0731c45ea13e318718567/bcbio/variation/freebayes.py#L121 and GATK: https://github.com/chapmanb/bcbio-nextgen/blob/9b3e9e167cc26be203d0731c45ea13e318718567/bcbio/variation/gatk.py#L71). For joint recalling, those are in bcbio for GATK (https://github.com/chapmanb/bcbio-nextgen/blob/master/bcbio/variation/gatkjoint.py) and in bcbio.variation.recall for FreeBayes (https://github.com/chapmanb/bcbio.variation.recall/blob/48b06869b58eebfb2380ec728605ec18e9bfac0f/src/bcbio/variation/recall/square.clj#L62). Hope this helps.
  
  Brad Chapman
  
  November 28, 2015 at 9:20 pm
  
  Reply
I have encountered a few people that did not considered using GATK alternatives because they interpreted from this post that samtools, playtypus and freebayes cannot do joint variant calling (i.e. standard stuff – non incremental) without tweaking with extra code. Could you just leave a small statement here that joint calling is perfectly possible if you have access to all bams and are willing to do it all again?

Thanks

Tiago Antao

September 23, 2016 at 2:10 pm

Reply
- Tiago — sorry about the confusion, I tried to clarify the difference between joint (calling independently into gVCF and combine) versus batch (call all samples together) in the terminology. You’re exactly right that all of the alternative callers will call in batches and can do quite large batch sizes now, so a custom implementation is not required for many cases. If this is confusing, it might also be worth leaving a note on the updated post at bcb.io: http://bcb.io/2014/10/07/joint-calling/ I’d like to revisit this with recent callers in a more up to date post as well and will try to make this more clear. Thanks for the heads up.
  
  Brad Chapman
  
  October 10, 2016 at 10:49 am
  
  Reply

Blue Collar Bioinformatics