Blue Collar Bioinformatics

Note: new posts have moved to http://bcb.io/ Please look there for the latest updates and comments

Whole genome trio variant calling evaluation: low complexity regions, GATK VQSR and high depth filters

with 9 comments

Whole genome trio validation

I’ve written previously about the approaches we use to validate the bcbio-nextgen variant calling framework, specifically evaluating aligners and variant calling methods and assessing the impact of BAM post-alignment preparation methods. We’re continually looking to improve both the pipeline and validation methods and two recent papers helped advance best-practices for evaluating and filtering variant calls:

  • Michael Linderman and colleagues describe approaches for validating clinical exome and whole genome sequencing results. One key result I took from the paper was the difference in assessment between exome and whole genome callsets. Coverage differences due to capture characterize discordant exome variants, while complex genome regions drive whole genome discordants. Reading this paper pushed us to evaluate whole genome population based variant calling, which is now feasible due to improvements in bcbio-nextgen scalability.
  • Heng Li identified variant calling artifacts and proposed filtering approaches to remove them, as well as characterizing caller error rates. We’ll investigate two of the filters he proposed: removing variants in low complexity regions, and filtering high depth variants.

We use the NA12878/NA12891/NA12892 trio from the CEPH 1463 Pedigree as an input dataset, consisting of 50x whole genome reads from Illumina’s platinum genomes. This enables both whole genome comparisons, as well as pooled family calling that replicates best-practice for calling within populations. We aligned reads using bwa-mem and performed streaming de-duplication detection with samblaster. Combined with no recalibration or realignment based on our previous assessment, this enabled fully streamed preparation of BAM files from input fastq reads. We called variants using two realigning callers: FreeBayes (v0.9.14-7) and GATK HaplotypeCaller (3.1-1-g07a4bf8) and evaluated calls using the Genome in a Bottle reference callset for NA12878 (v0.2-NIST-RTG). The bcbio-nextgen documentation has full instructions for reproducing the analysis.

This work provides three practical improvements for variant calling and validation:

  • Low complexity regions contribute 45% of the indels in whole genome evaluations, and are highly variable between callers. This replicates Heng’s results and Michael’s assessment of common errors in whole genome samples, and indicates we need to specifically identify and assess the 2% of the genome labeled as low complexity. Practically, we’ll exclude them from further evaluations to avoid non-representative bias, and suggest removing or flagging them when producing whole genome variant calls.
  • We add a filter for removing false positives from FreeBayes calls in high depth, low quality regions. This removes variants in high depth regions that are likely due to copy number or other larger structural events, and replicates Heng’s filtering results.
  • We improved settings for GATK variant quality recalibration (VQSR). The default VQSR settings are conservative for SNPs and need adjustment to be compatible with the sensitivity available through FreeBayes or GATK using hard filters.

Low complexity regions

Low complexity regions (LCRs) consist of locally repetitive sections of the genome. Heng’s paper identified these using mdust and provides a BED file of LCRs covering 2% of the genome. Repeats in these regions can lead to artifacts in sequencing and variant calling. Heng’s paper provides examples of areas where a full de-novo assembly correctly resolves the underlying structure, while local reassembly variant callers do not.

To assess the impact of low complexity regions on variant calling, we compared calls from FreeBayes and GATK HaplotypeCaller to the Genome in a Bottle reference callset with and without low complexity regions included. The graph below shows concordant non-reference variant calls alongside discordant calls in three categories: missing discordants are potential false negatives, extras are potential false positives, and shared are variants that overlap between the evaluation and reference callsets but differ due to zygosity (heterozygote versus homozygote) or indel representation.

Low complexity regions for GATK and FreeBayes validation

  • For SNPs, removing low complexity regions removes approximately ~2% of the total calls for both FreeBayes and GATK. This corresponds to the 2% of the genome subtracted by removing LCRs.
  • For indels, removing LCRs removes 45% of the calls due to the over-representation of indels in repeat regions. Additionally, this results in approximately equal GATK and FreeBayes concordant indels after LCR removal. Since the Genome in a Bottle reference callset uses GATK HaplotypeCaller to resolve discrepant calls, this change in concordance is likely due to bias towards GATK’s approaches for indel resolution in complex regions.
  • The default GATK VQSR calls for SNPs are not as sensitive, relative to FreeBayes calls. I’ll describe additional work to improve this below.

Practically, we’ll now exclude low complexity regions in variant comparisons to avoid potential bias and more accurately represent calls in the remaining non-LCR genome. We’ll additionally flag low complexity indels in non-evaluation callsets as likely to require additional followup. Longer term, we need to incorporate callers specifically designed for repeats like lobSTR to more accurately characterize these regions.

High depth, low quality, filter for FreeBayes

The second filter proposed in Heng’s paper was removal of high depth variants. This was a useful change in mindset for me as I’ve primarily thought about removing low quality, low depth variants. However, high depth regions can indicate potential copy number variations or hidden duplicates which result in spurious calls.

Comparing true and false positive FreeBayes calls with a pooled multi-sample call quality of less than 500 identifies a large grouping of false positive heterozygous variants at a combined depth, across the trio, of 200:

Heterozygotes by depth and quality: true versus false positive

The cutoff proposed by Heng was to calculate the average depth of called variants and set the cutoff as the average depth plus a multiplier of 3 or 4 times the square root of average depth. This dataset was an average depth of 169 for the trio, corresponding to a cutoff of 208 if we use the 3 multiplier, which compares nicely with a manual cutoff you’d set looking at the above graphs. Applying a cutoff of QUAL < 500 and DP > 208 produces a reduction in false positives with little impact on sensitivity:

Improvement in filtering false positives with high depth filter

A nice bonus of this filter is that it makes intuitive sense: variants with high depth and low quality indicate there is something problematic, and depth manages to partially compensate for the underlying issue. Inspired by GATK’s QualByDepth annotation and default filter of QD < 2.0, we incorporated a generalized version of this into bcbio-nextgen’s FreeBayes filter: QUAL < (depth-cutoff * 2.0) and DP > depth-cutoff.

GATK variant quality score recalibration (VQSR)

The other area where we needed to improve was using GATK Variant Quality Score Recalibration. The default parameters provide a set of calls that are overly conservative relative to the FreeBayes calls. VQSR provides the ability to tune the filtering so we experimented with multiple configurations to achieve approximately equal sensitivity relative to FreeBayes for both SNPs and Indels. The comparisons use the Genome in a Bottle reference callset for evaluation, and include VQSR default settings, multiple tranche levels and GATK’s suggested hard filters:

VQSR tuning: SNPs VQSR tuning: indels

While the sensitivity/specificity tradeoff depends on the research question, in trying to set a generally useful default we’d like to be less conservative than the GATK VQSR default. We learned these tips and tricks for tuning VQSR filtering:

  • The default setting for VQSR is not a tranche level (like 99.0), but rather a LOD score of 0. In this experiment, that corresponded to a tranche of ~99.0 for SNPs and ~98.0 for indels. The best-practice example documentation uses command line parameter that specify a consistent tranche of 99.0 for both SNPs and indels, so depending on which you follow as a default you’ll get different sensitivities.
  • To increase sensitivity, increase the tranche level. My expectations were that decreasing the tranche level would include more variants, but that actually applies additional filters. My suggestion for understanding tranche levels is that they specify the percentage of variants you want to capture; a tranche of 99.9% captures 99.9% of the true cases in the training set, while 99.0% captures less.
  • We found tranche settings of 99.97% for SNPs and 98.0% for indels correspond to roughly the sensitivity/specificity that you achieve with FreeBayes. These are the new default settings in bcbio-nextgen.
  • Using hard filtering of variants based on GATK recommendations performs well and is also a good default choice. For SNPs, the hard filter defaults are less conservative and more in line with FreeBayes results than VQSR defaults. VQSR has improved specificity at the same sensitivity and has the advantage of being configurable, but will require an extra tuning step.

Overall VQSR provides good filtering and the ability to tune sensitivity but requires validation work to select tranche cutoffs that are as sensitive as hard filter defaults, since default values tend to be overly conservative for SNP calling. In the absence of the ability or desire to tune VQSR tranche levels, the GATK hard filters provide a nice default without much of a loss in precision.

Data availability and future work

Thanks to continued community work on improving variant calling evaluations, this post demonstrates practical improvements in bcbio-nextgen variant calling. We welcome interested contributors to re-run and expand on the analysis, with full instructions in the bcbio-nextgen example pipeline documentation. Some of the output files from the analysis may also be useful:

  • VCF files for FreeBayes true positive and false positive heterozygote calls, used here to improve filtering via assessment of high depth regions. Heterozygotes make up the majority of false positive calls so take the most work to correctly filter and detect.
  • Shared false positives from FreeBayes and GATK HaplotypeCaller. These are potential missing variants in the Genome in a Bottle reference. Alternatively, they may represent persistent errors found in multiple callers.

We plan to continue to explore variant calling improvements in bcbio-nextgen. Our next steps are to use the trio population framework to compared pooled population calling versus the incremental joint discovery approach introduced in GATK 3. We’d also like to compare with single sample calling followed by subsequent squaring off/backfilling to assess the value of concurrent population calling. We welcome suggestions and thoughts on this work and future directions.

Written by Brad Chapman

May 12, 2014 at 6:03 am

9 Responses

Subscribe to comments with RSS.

  1. Hi, Thanks for the bcbio-nextgen. In example pipeline, we can analysis whole genome trio. But there is no configure file *.yaml for exome trio. So when I analysis trio exome data, I add parameter “batch:*” in the *.yaml file. Is that right? or should I modify other parameter? The following is configure file:

    details:
    – files: [../input/sample10.R1.fastq.gz, ../input/sample10.R2.fastq.gz]
    description: sample10
    metadata:
    batch: ceph
    sex: male
    analysis: variant2
    genome_build: GRCh37
    algorithm:
    aligner: bwa
    align_split_size: 5000000
    mark_duplicates: true
    recalibrate: false
    realign: false
    variantcaller: [freebayes, gatk-haplotype]
    quality_format: Standard
    coverage_interval: regional
    remove_lcr: true
    validate: ../input/GiaB_NIST_RTG_v0_2.vcf.gz
    validate_regions: ../input/GiaB_NIST_RTG_v0_2_regions.bed
    variant_regions: ../input/NGv3.bed
    – files: [../input/sample08.R1.fastq.gz, ../input/sample08.R2.fastq.gz]
    description: sample08
    metadata:
    batch: ceph
    sex: male
    analysis: variant2
    genome_build: GRCh37
    algorithm:
    aligner: bwa
    align_split_size: 5000000
    mark_duplicates: true
    recalibrate: false
    realign: false
    variantcaller: [freebayes, gatk-haplotype]
    quality_format: Standard
    coverage_interval: regional
    – files: [../input/sample09.R1.fastq.gz, ../input/sample09.R2.fastq.gz]
    description: sample09
    metadata:
    batch: ceph
    sex: female
    analysis: variant2
    genome_build: GRCh37
    algorithm:
    aligner: bwa
    align_split_size: 5000000
    mark_duplicates: true
    recalibrate: false
    realign: false
    variantcaller: [freebayes, gatk-haplotype,gatk]
    quality_format: Standard
    coverage_interval: regional

    shangqianxie

    October 20, 2014 at 11:44 pm

    • Thanks for trying out bcbio. The configuration generally looks great although you should add `variant_regions: ../input/NGv3.bed` to each of the samples so they all get called only in exome regions. Let us know if you have any problems. You’re welcome to ask questions here but the Biovalidation mailing list (https://groups.google.com/forum/#!forum/biovalidation) or GitHub issues page (https://github.com/chapmanb/bcbio-nextgen/issues) might make formatting and discussion easier. Thanks again.

      Brad Chapman

      October 21, 2014 at 2:55 pm

      • Hi Brad,
        Thanks so much for your reply. I have another question: there are wgs and wgs_joint two pipeline which one should I take for my 30X whole genome data, and what’s the difference between them. Thanks again!
        Best,
        Shangqian

        shangqianxie

        October 22, 2014 at 9:35 pm

  2. “Heng’s paper identified these using mdust and provides a BED file of LCRs covering 2% of the genome.”

    I downloaded that bed file of LCR. Then asked the percent of the genome it covers. It covers almost 10%. Can you tell me why is it listed as 2% in the sentence?

    Thanks,
    Mary

    Mary

    March 9, 2015 at 5:28 pm

    • Mary;
      Sorry for any confusion. The LCR file also includes non-resolved (N-only) regions in the genome, primarily near centromeres and telomeres. This makes up ~8% of the total BED file size, and the remainder are the smaller low complexity regions specific to this dataset. Hope that helps.

      Brad Chapman

      March 10, 2015 at 4:56 am

  3. Can you share what was the read length in the data you analyzed here, as this should be a critical parameter on how reliable SNVs in shorter LC region are. I personally find it strange to categorically exclude LC regions, as there are regulatory elements and coding sequences fully co-located with LCs, hence any mutations there would be impossible to pick up.

    Alexander T

    October 10, 2017 at 5:10 pm

    • Alexander;
      Thanks for the helpful feedback. These are 100bp paired end reads (https://www.ebi.ac.uk/ena/data/view/ERR194147). The purpose of excluding them is not that they’re unimportant, but that the truth sets we current have do a poor job of resolving these regions in a general way. Since a small percentage (2%) of the genome contributes to a large portion (45%) of the calls, this masks how callers do in the remainder of the genome. The right way forward is to have specialized truth sets that carefully resolve medically important mutations in LCRs in an unbiased way so we can determine how well callers do in these regions and continue to improve them. As always, lots of work to do to continue to improve truth sets and callers themselves. Thanks again for the discussion.

      Brad Chapman

      October 11, 2017 at 9:38 am


Leave a comment