Genomics X Prize public phase update: variant classification and de novo calling

Background

Last month I described our work at HSPH and EdgeBio preparing reference genomes for the Archon Genomics X Prize public phase, detailing methods used in the first version of our NA19239 variant calls. We’ve been steadily improving the calling approaches, and released version 0.2 on the X Prize validation website and GenomeSpace. Here I’ll describe the improvements we’ve made over the last month, focusing on two specific areas:

De novo calling: Zam Iqbal suggested using his cortex_var de novo variant caller in addition to the current GATK, FreeBayes and samtools callers. With his help, we’ve included these calls in this release, and provide comparisons between de novo and alignment based methods.
Improved variant classification: Consolidating variant calls from multiple callers involves making tough choices about when to include or exclude variants. I’ll describe the details of selecting metrics for use in SVM classification and filtering of variants.

Our goal is to iteratively improve our calling and variant preparation to create the best possible set of reference calls. I’d be happy to talk more with anyone working on similar problems or with insight into useful ways to improve our current callsets. We have a Get Satisfaction site for discussion and feedback, and have offered a $2500 prize for helpful comments.

As a reminder, all of the code and data used here is freely available:

The variant analysis infrastructure, built on top of GATK, automates genome preparation, normalization and comparison. It provides a full pipeline, driven by simple configuration files, for consolidating multiple variant calls.
The combined variant calls, including training data and potential true and false positives, are available from GenomeSpace: Public/chapmanb/xprize/NA19239-v0_2.
The individual variant calls for each technology and calling method are also available from GenomeSpace: Public/EdgeBio/PublicData/Release1.

de novo variant calling with cortex_var

de novo variant calling performs reference-free assembly of either local or global genome regions, then subsequently uses these assemblies to call variants relative to a known reference. The advantage is that assemblies can avoid errors associated with mapping to the reference, helping resolve large variations as well as small variations near problem alignments or low complexity regions.

Hybrid approaches that use localized de novo assembly in variant regions help mitigate the extensive computational requirements associated with whole-genome assembly. Complete Genomics variant calling and GATK 2.0’s Haplotype Caller both provide pipelines for hybrid de novo assembly in variant detection. The fermi and SGA assemblers are also used in variant calling, although the paths from assembly to variants are not as automated.

Thanks to Zam’s generous assistance, we used cortex_var for localized de novo assembly and variant calling within individual fosmid boundaries. As a result, CloudBioLinux now contains automated build instructions for cortex_var , handling binary builds for multiple k-mer and color combinations. An automated cortex_var pipeline, part of the bcbio-nextgen toolkit, runs the processing workflow:

Start with reads aligned to fosmid regions using novoalign.
For each fosmid region, extract all mapped reads along with a local reference genome for variant calling.
de novo assemble all reads in the fosmid region and call variants against the local reference genome using cortex_var’s Bubble Caller.
Remap regional variant coordinates back to the full genome.
Combine all regional calls into the final set of cortex_var calls.

Directly comparing GATK and cortex_var calls shows similar levels of concordance and discordance as the GATK/samtools comparison from the last post:


concordant: total	153787
concordant: SNPs	130913
concordant: indels	22874
GATK discordant: total	20495
GATK discordant: SNPs	6522
GATK discordant: indels	13973
cortex_var discordant: total	26790
cortex_var discordant: SNPs	21342
cortex_var discordant: indels	5448

11% of the GATK calls and 14% of the cortex_var calls are discordant. The one area where cortex_var does especially well is on indels: 19% of the cortex_var indels do not agree with GATK, in comparison with 37% of the GATK calls and 25% of the samtools calls. The current downside to this is SNP calling, where cortex_var has 3 times more discordant calls than GATK.

Selection of classification metrics

Overlapping variant calls from different calling methods (GATK, FreeBayes, samtools and cortex_var) and sequencing technologies (Illumina, SOLiD and IonTorrent) produces 170,286 potential calls in the fosmid regions. 58% (99,227) of these are present in all callers and technologies, so we need to do better than the intersection in creating a consolidated callset.

As detailed in the previous post, we filter the full set in two ways. The first is to keep trusted variants based on their presence in a defined number of technologies or callers. We currently have an inclusive set of criteria: keep variants present in either 4 out of the 7 callsets, 2 distinct technologies, or 3 distinct callers. This creates a trusted set containing 95% (162,202) of the variants. Longer term the goal is to reduce the trusted count and rely on automated filtering approaches based on input metrics.

This second automated filtering step uses a support vector machine (SVM) to evaluate the remaining variants. We train the SVM on potential true positives from variants that overlap in all callers and technologies, and potential false positives found uniquely in one single caller. The challenge is to find useful metrics associated with these training variants that will help provide discrimination.

In version 0.1 we filtered with a vanilla set of metrics: depth and variant quality score. To identify additional metrics, we used a great variant visualization tool developed by Keming Labs alongside HSPH and EdgeBio. I’ll write up more details about the tool once we have a demonstration website but the code is already available on GitHub.

To remove variants preferentially associated with poorly mapping or misaligned reads, we identified two useful metrics. ReadPosEndDist, written as a GATK annotation by Justin Zook at NIST, identifies variants primarily supported by calls at the ends of reads. Based on visual examination, these associate with difficult to map regions, as identified by Genome Mappability Scores:

Secondly, we identified problematic allele balances that differ from the expected ratios. For haploid fosmid calls, we expect 100% of reads to support variants and 0% to support reference calls (in diploid calls, you also need to handle heterozygotes with 50% expected allele balance). In practice, the distribution of reads can differ due to sequencer and alignment errors. We use a metric that measures deviation from the expected allele balance and associates closely with variant likelihoods:

Improved consolidated calls

To assess the influence of adding de novo calls and additional filtering metrics on the resulting call set, we compare against whole genome Illumina and Complete Genomics calls for NA19239. Previously we’d noticed two major issues during this comparison: a high percentage of discordant indel calls and a technology bias signaled by better concordance with Illumina than Complete.

The comparison between fosmid and Illumina data shows a substantial improvement in the indel bias. Previously 46% of the total indel calls were discordant, indicative of a potential false positive problem. With de novo calls and improved filtering, we’ve lowered this to only 10% of the total calls.


concordant: total	147684
concordant: SNPs	133861
concordant: indels	13823
fosmid discordant: total	7519
fosmid discordant: SNPs	5856
fosmid discordant: indels	1663
Illumina discordant: total	5640
Illumina discordant: SNPs	1843
Illumina discordant: indels	3797

This improvement comes with a decrease in the total number of concordant indel calls, since we moved from 17,816 calls to 13,823. However a large number of these seemed to be Illumina specific since 60% of the previous calls were discordant when compared with Complete Genomics. The new callset reduces this discrepancy and only 22% of the indels are now discordant:


concordant: total	139155
concordant: SNPs	127243
concordant: indels	11912
fosmid discordant: total	15484
fosmid discordant: SNPs	12028
fosmid discordant: indels	3456
Complete Genomics discordant: total	7311
Complete Genomics discordant: SNPs	4972
Complete Genomics discordant: indels	2273

These comparisons provide some nice confirmation that we’re moving in the right direction on filtering. As before, we extract potential false positives and false negatives to continue to refine the calls: potential false positives are those called in the fosmid dataset and in neither of the Illumina or Complete Genomics sets. Potential false negatives are calls that both Illumina and Complete agree on that the fosmid calls lack.

In the new callsets, there are 5,499 (3.5%) potential false positives and 1,422 (0.9%) potential false negatives. We’ve reduced potential false positives in the previous set from 10% with a slight increase in false negatives. These subsets are available along with the full callset on GenomeSpace. We’re also working hard on an NA12878 callset with equivalent approaches and will make that available soon for community feedback.

I hope this discussion, open source code, and dataset release is useful to everyone working on problems of improving variant calling accuracy and filtering. I welcome feedback on calling, consolidation methods, interesting metrics to explore, machine learning or any of the other topics discussed here.

Written by Brad Chapman

September 17, 2012 at 8:41 am

Posted in xprize

Tagged with bioinformatics, clinical, clojure, gatk, ngs, variant, visualization, xprize

7 Responses

Subscribe to comments with RSS.

Thanks for these really informative posts. I’m wondering if you could include or discuss the Ti / Tv ratios of the concordant and discordant calls for the various assemblers and callers? Often this is a decent measure of how enriched the various variant groups are for false positive calls. Keep up the good work….

Brendan O'Fallon

October 10, 2012 at 9:30 pm

Reply
- Brendan;
  Thanks for the positive feedback and the suggestion to report transitions and transversions. As you suspected, the ratios are off for the identified potential false positives and negatives. The negative training set designed to identify these also reflects this:
```
| name                           |     tstv | tstv-coding | tstv-noncoding |  count |
|--------------------------------+----------+-------------+----------------+--------|
| NA19239-v0_2-xprize            | 2.077482 |    2.380062 |       2.070972 | 156665 |
| NA19239-v0_2-prep-postrain     | 2.215740 |    2.472993 |       2.209839 |  99227 |
| NA19239-v0_2-prep-negtrain     | 1.027473 |    1.333333 |       1.024931 |    930 |
| NA19239-v0_2-cmp-potential-fps | 1.474466 |    1.692308 |       1.469301 |   5499 |
| NA19239-v0_2-cmp-potential-fns | 0.639576 |    1.000000 |       0.634409 |   1422 |
```
  One likely cause of this is that many disputed calls are in repetitive regions, reflecting different underlying mutation mechanisms. The work I’ve been doing since this post is to classify repetitive regions separate from more standard regions during filtering. I’ll definitely write up more details as I lock in on some approaches that work well.
  
  Thanks again,
  Brad
  Brad Chapman
  
  October 11, 2012 at 2:34 pm
  
  Reply
Great discussion! I am using a very similar method to call variations: I use GATK, FreeBayes, samtools (Mpileup and pileup) and cortex_var. And my observation agrees with you results. Some people do not like combining many callers, but I do not think any single caller can be fully trusted.

I have a questions.
1.) What makes Cortex_var calling more false SNPs? I found recent Cortex_var call more false positive than an older version. As far as I can see Cortex_var is likely call false hetero SNPs at bases where a reference has error since any assemblers are likely to make similar errors for repetitive region.

Hideo

Pico

October 29, 2012 at 6:43 am

Reply
- Hideo;
  Great to hear you are successfully using similar approaches. Do you have code, methodology or results available? It would be great to compare outputs for the same dataset.
  
  In terms of cortex_var false positives, we’ve found that both cortex_var and FreeBayes aggressively call variants in low coverage regions. They’ll happily call a variant with single read support, relying on post filtering to remove these. We use GATK’s CallableLoci walker and only assess in callable regions:
  
  http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_coverage_CallableLoci.html
  
  If you’re finding more persistent issues it would be worth generating some example data and discussing on the cortex_var mailing list. Zam helped us resolve some processing issues and is very responsive.
  
  Brad
  
  Brad Chapman
  
  October 31, 2012 at 7:47 am
  
  Reply
- Hi there
  I thought I would just pipe up here as Hideo came to ask me about the recent increase in false positives.
  
  First: It turned out the comparison was not like-for-like. In his old dataset he used the Bubble Caller, and in the new he used both Bubble Caller and Path Divergence.
  Second: he’s working on a species with a low/limited quality reference genome, which impacts the Path Divergence caller (which compares the graph with the reference).
  
  So the bottom line is – nothing much has changed in the Cortex calls, which is a relief as I didn’;t change anything!
  Zam
  
  Zamin Iqbal
  
  November 1, 2012 at 6:31 am
  
  Reply
Hi Zam and Brad,
Since I posted my comment, I have aggressively masked the genome where we cannot call SNP reliably and then every thing became more clean.

Cortex was finding new odd SNPs when I used a newer one but all of odd SNPs were screened out based on either depth, kmer and manual masking etc. So it is essential to mask out a reference for any SNP callers. De novo callers can identify mis-assemblies but it is hard to make use such de novo finding when the reference is incomplete. To be specific, after screening SNPs, Cortex had least false positives among 5 caller, I believe. And Cortex new and old Cortex agree on reliable SNPs.

I do check strand bias of Cortex SNPs based on pileup data since these strand biased Cortex SNPs stand out from the rest of high quality Cortex SNPs — these biased SNPs were all rejected by other methods for strand bias.

I did not write down any method yet but my cut off for each caller is lenient. General SNP cut offs are: Pileup >=20, Mpileup >=20, GATK>=50, Freebayes>=50, Cortex>3. Mapping score>40, Normalized depth cut off 0.5 =< DEP < 1.75.
Masking files include 30bp kmer sliding window unique region, 100 bp next to GAPs, manually excluded regions based on depth and SNP concentrations among ~300 closely related strains. I use soft masking for low complexity region identified by tantan http://www.cbrc.jp/tantan/. Soft masking means if a given site is accepted by more 3.5 callers on average among the population, then a SNP is accepted.

And after calling SNPs on ~300 strains, I check number of callers, median depth, success rate of depth and etc for each site among the population. Then I accept SNP sites whose number of accepting callers (NC) is greater 2.5. NC 2.5 seems arbitrary but an average of NC for each strain is about 4.5. So there is not so much ambiguity in this number.

How am I checking my results???? Our samples are from very clonal parasites and many of them came from the reference strain which our colleague grow. So we expect zero homozygous SNPs in many of the samples. Also we mixed bams of 3 strains whose mixture ratio is 10, 20, … 80 and 90. So we can tell how callers are identifying SNPs whose allele base frequencies are changing along with the mixing.

And we made networks using splitstree to see which methods can reproduce expected network since we mixed bams so we know expected more or less true network structure. In addition of these artificially mixed samples, we have many samples related to the 3 strains. We had 2 drug induction experiments for 3 strains and we have drug induction intermediate samples so we can monitor some known SNPs: which sometime are very subtle but we can monitor them since we know they are there.

Results:
Any case, a lot of colleague do not believe my 5 caller method (5CM) works. Some accused me I was producing complete junk trees. (There were serious samples mixed ups at some level so I was accused of computationally causing it!!) In short (it is not short any more but) I compared 4 approaches. The 5 caller population method, the 5 caller individual method (using score based on only a strain), GATK individual method and GATK population method. And 5CM and GATK population methods did produced more accurate network than individual counter parts. So did 5CM did better than the GATK population method?

POP 5CM and POP GATK called 11389 concordant SNPs in 109 strains (these are a limited subset of ~300). The GATK called 718 discordant SNPs and the 5CM 396 discordant SNPs. Considering true positive, true negative, false positive and false negative, Among discordant SNPs the 5 CM had 89% true calls while 11% true calls.

To see the accuracy I manually check SNPs in artemis, samtools tview, since most of 109 strains were derived from 3 strains experimentally or artificially. So I do not think my number is off that much.

Advanced GATK options:
GATK has more advanced options, wonderful for human genome but very difficult to use for any thing else. As far as I understand GATK, all the advanced options will reduce false positives, but not increase the power to detect subtle SNPs (sensitivity). So I assume the raw GATK is most sensitive even though it calls a full load of false positive. (I may be wrong.) To see if I am wrong, and I used based quality calibration tools to fix and clean base quality of bams, and used GATK to call SNPs from cleaned bams.

Then, much fewer (false) SNPs were found in the clean bams and in general, SNPs score decreased because base quality is depressed. So I think I am right to say base quality recalibration would not increase sensitivity. So, I do not think the POP 5CM is not missing any SNPs that GATK can find. And GATK cut off of 50 I am using is generally too low for independent use. So, I claim that the POP 5CM, which is similar to Brad's method, is better choice for non-model organisms of genome size about 40M.
I tested some other cut off for GATK and Freebayes but it just did not make any big difference.

GATK uses machine learning, but junk in, junk out. If we do not have constantly maintained good SNP data base like human, and if genome of organisms is more stratified and variable, then it is very difficult to use GATK very well.

So if some one get our paper in future, please do not trash it out. We tried many things but found no one caller can do better than some kind of consensus method.

I am happy to hear from GATK people to trash out my method. But I am using GATK extensively …

Hideo
PS. If any one likes to see actual data, please let me know. I am happy to show what I have.

Pico

December 2, 2012 at 7:29 am

Reply
- Hideo;
  Thanks for sharing your approaches and results. I definitely agree with your consensus methodology and think there’s a lot of value in comparing multiple datasets. One point that I especially second is the value of assessing repeat and non-repeat regions with different sets of parameters. I’ve been building models independently for these regions as well.
  
  Directly comparing your parasite population calls against the human work we’ve been doing might be a bit difficult but I’d be interested in seeing the code you’ve been using to build your consensus sets. You described a bit about accepting calls based on number of contained callsets. Do you have any provisions for including calls with less than your threshold? For instance, if a call is only in two callers but very confidently called by both, would it always be excluded? Do you have a sense of how many of these type you exclude and how adjusting your threshold changes this?
  
  Glad you’re having success with the approach and looking forward to hearing more as you build on it,
  Brad
  
  Brad Chapman
  
  December 2, 2012 at 9:19 pm
  
  Reply

Blue Collar Bioinformatics