Blue Collar Bioinformatics

Note: new posts have moved to http://bcb.io/ Please look there for the latest updates and comments

Validated whole genome structural variation detection using multiple callers

with 17 comments

Structural variant detection goals

This post describes community based work integrating structural variant calling and validation into bcbio-nextgen. I’ve previously written about approaches for validating single nucleotide changes (SNPs) and small insertions/deletions (Indels), but it has always been unfortunate to not have reliable ways to detect larger structural variations: deletions, duplications, inversions, translocations and other disruptive events. Detecting these events with short read sequencing is difficult, and our goal in bcbio is to create a global summary of predicted structural variants from multiple callers alongside measures of sensitivity and precision.

The latest release of bcbio automates structural variant calling with three callers:

bcbio integrates structural variation predictions from all approaches into a high level BED file. This is a first pass way to identify potentially disruptive large scale events. Here are example regions: a duplication called by all 3 callers, a deletion called by 2 callers, and a complex region with both deletions and duplications.

9  139526855 139527537 DUP_delly,DUP_lumpy,cnv3_cn_mops
10  99034861  99037400 DEL_delly,cnv0_cn_mops,cnv1_cn_mops
12   8575814   8596742 BND_lumpy,DEL_delly,DEL_lumpy,DUP_delly,DUP_lumpy,cnv1_cn_mops,cnv3_cn_mops

This output associates larger structural events with regions of interest in a high level way, while allowing us to quickly determine the individual tool support for each event. Using this, we are no longer blind to potential structural changes and can use the summary to target in-depth investigation with the more detailed metrics available from each caller and a read viewer like PyBamView. The results can also help inform prioritization of SNP and Indel calls since structural rearrangements often associate with false positives. Longer term we hope this structural variant summary and comparison work will be useful for community validation efforts like the Global Alliance for Genomics and Health benchmarking group, the Genome in a Bottle consortium and the ICGC-TCGA DREAM Mutation Calling challenge.

Below I’ll describe a full evaluation of the sensitivity and precision of this combined approach using an NA12878 trio, as well as describe how to run and extend this work using bcbio-nextgen.

This blog post is the culmination of a lot of work and support from the open source bioinformatics community. David Jenkins worked with our our group for the summer and headed up evaluation of structural variation results. We received wonderful support from Colby Chang, Ryan Layer, Ira Hall and Aaron Quinlan on both LUMPY and structural variation validation in general. They freely shared scripts and datasets for evaluation, which gave us the materials to make these comparisons. Günter Klambauer gave us great practical advice on using cn.mops. Tobias Rausch helped immensely with tips for speeding up DELLY on whole genomes, and Ashok Ragavendran from Mike Talkowski’s lab generously discussed tricks for scaling DELLY runs. Harvard Research Computing provided critical support and compute for this analysis as part of a collaboration with Intel.

Evaluation

To validate the output of this combined structural variant calling approach we used a set of over 4000 validated deletions made available by Ryan Layer as part of the LUMPY manuscript. These are deletion calls in NA12878 with secondary support evidence from Moleculo and PacBio datasets. We further subset these regions by removing calls in low complexity regions identified in Heng Li’s variant calling artifacts paper. An alternative set of problematic regions are the potential misassembly regions identified by Colby Chang and Ira Hall during LUMPY and speedseq development (Edit: The original post mistakenly mentioned these overlap significantly with low complexity regions, but that is only due to overlap in obvious problem areas like the N gap regions. We’ll need additional work to incorporate both regions into future evaluations. Thanks to Colby for catching this error.). These are a useful proxy for regions we’re currently not able to reliably call structural variants in.

We ran all structural variant callers using the NA12878/NA12891/NA12892 trio from the CEPH 1463 Pedigree as an input dataset. This consists of 50x whole genome reads from Illumina’s platinum genomes project, and is the same dataset used in our previous analyses of population based SNP and small indel calling.

Our goal is to define boundaries on sensitivity – the percentage of validated calls we detect – and precision – how many of the total calls overlap with validated regions. We required a simple overlap of the called regions with validated regions to consider a called variant as validated, and stratified results by event size to quantify detection metrics at different size ranges.

The comparison highlights the value of providing a combined call set. I’d caution against using this as a comparison between methods. Accurate structural variation calling depends on multiple sources of evidence and we still have work to do in improving our ability to filter for maximal sensitivity and specificity. The ensemble method in the figure below displays results of our final calls, made from collapsing structural variant calls from all three input callers:

Structural variant calling sensitivity and precision at different event sizes

Across all size classes, we detect approximately half of the structural variants and expect that about half of the called events are false positives. Smaller structural variants of less than 1kb are the most difficult to detect with these methods. Larger events from 1kb to 25kb have better sensitivity and precision. As the size of the events increase precision decreases, so larger called events tend to have more false positives.

Beyond the values for sensitivity and precision, the biggest takeaway is that combining multiple callers helps detect additional variants we’d miss with any individual caller. Count based callers like cn.mops enable improved sensitivity on large deletions but don’t resolve small deletions at 50x depth using our current settings, although tuning can help detect these smaller sized events as well. Similarly, lumpy and delly capture different sets of variants across all of the size classes.

The comparison also emphasizes the potential for improving both individual caller filtering and ensemble structural variant preparation. The ensemble method uses bedtools to create a merged superset of all individually called regions. This is the simplest possible approach to combine calls. Similarly, individual caller filters are intentionally simple as well. cn.mops calling performs no additional filtering beyond the defaults, and could use adjustment to detect and filter smaller events. Our DELLY filter requires 4 supporting reads or both split and paired read evidence. Our LUMPY filter require at least 4 supporting reads to retain an event. We welcome discussion of the costs and tradeoffs of these approaches. For instance, requiring split and paired evidence for DELLY increases precision at the cost of sensitivity. These filters are a useful starting point and resolution, but we hope to continue to refine and improve them over time.

Implementation

bcbio-nextgen handles installation and automation of the programs used in this comparison. The documentation contains instructions to download the data and run the NA12878 trio calling and validation. This input configuration file should be easily adjusted to run on your data of interest.

The current implementation has reasonable run times for whole genome structural variant calling. We use samblaster to perform duplicate marking alongside identification of discordant and split read pairs. The aligned reads from bwa stream directly into samblaster, adding minimal processing time to the run. For LUMPY calling, the pre-prepared split and discordant reads feed directly into speedseq, which nicely automates the process of running LUMPY. For DELLY, we subsample correct pairs in the input BAM to 50 million reads and combine with the pre-extracted problematic pairs to improve runtimes for whole genome inputs.

We processed three concurrently called 50x whole genome samples from FASTQ reads to validated structural variants in approximately 3 days using 32 cores. Following the preparation work described above, LUMPY calling took 6 hours, DELLY takes 24 hours parallelized on 32 cores and cn.mops took 16 hours parallelized by chromosome on 16 cores. This is a single data point for current capabilities, and is an area where we hope to continue to improve scalability and parallelization.

The implementation and validation are fully integrated into the community developed bcbio-nextgen project and we hope to expand this work to incorporate additional structural variant callers like Pindel and CNVkit, as well as improving filtering and ensemble calling. We also want to expand structural variant validation to include tumor/normal cancer samples and targeted sequencing. We welcome contributions and suggestions on current and future directions in structural variant calling.

Written by Brad Chapman

August 12, 2014 at 1:22 pm

17 Responses

Subscribe to comments with RSS.

  1. When you reference structure in this post, are you referring to three-dimensional structure or something else? This is a question I have been trying to answer on my own, so I appreciate this opportunity to check with someone who knows.

    I am not a bioinformatics professional or a geneticist, but I am professionally interested in genomics in the context of its implications for public policy and so I am trying to understand how this works as best I can. I have written a couple of posts about genomics in regards to the influence of structure on genetic expression (http://wp.me/p4RCji-3F and http://wp.me/p4RCji-51) . Would you be willing to read through those posts and clarify any issues you see with my understanding of genomics and bioinformatics?

    Thanks

    • Thanks for the thoughts. This is a different area than chromatin structure, and relates to rearrangments in chromosomal sequence composition and ordering. Sorry the nomenclature is a bit confusing and I should have put a short intro on this in the post. The wikipedia page provides a nice overview: https://en.wikipedia.org/wiki/Structural_variation

      Regarding your more general questions about epigenetics, it is definitely an area of active research. Your posts paint scientists as not thinking epigenetics is important, which I don’t agree with. Many research labs are working in that area; in our support requests, ChIP-seq is as popular as variant calling and RNA-seq. Ultimately we know we’ll need integrative models that combine all these different factors, but right now investigating each individually with a high level of precision and specificity is difficult enough, hence the current research focus. I think most scientists would agree we need to be able to build complex models of compilicated biological interactions at many levels to be able to understand hard questions, and hopefully integrative tools will emerge to let us do that. Thanks again.

      Brad Chapman

      August 15, 2014 at 3:30 pm

      • Thanks for the response. and the clarifications. I just finished summarizing a research paper on autism spectrum disorder and epigenetics (here http://wp.me/p4RCji-5R) which I think refers more to rearrangements of chromosomal sequences you are talking about.

        If my posts came across as scolding scientists in general for not thinking about epigenetics, that was definitely not my intention so I will see how I can express that better. I do see the growing interest in research in epigenetics, but I have also seen a fair amount of skepticism and even hostility against epigenetics as well. In another series of posts (http://wp.me/P4RCji-f) I detail some of the history of epigenetics, and the increasing acceptance of epigenetics observed recently (which I also talk detail here http://wp.me/p4RCji-2i) is not indicative of the attitude towards epigenetics for the last eighty or so years.

        That being said, I appreciate hearing the perspective of someone working inside the business. It definitely adds to my understanding of what is going on.

        To that end, would you be willing to comment on my blog on the post about gene sequence and structure (http://wp.me/p4RCji-3F), or any other post that seems relevant, about your perspective on how genomics and epigenomics could be combined? You mention a little bit about that in your response, but it would be valuable for anyone reading my blog to have an idea about how this would work or could work in a nuts-and-bolts kind of way.

        Thanks again

        • Thanks much. I’m glad that was somewhat helpful and appreciate the clarifications. I’m not the most qualified to comment about unified methods as I don’t do a lot of work in that area but will definitely follow your discussions and chime in. Thanks again.

          Brad Chapman

          August 15, 2014 at 7:38 pm

  2. Dear Brad,

    As usual this post has been very informative. I just wanted to know if you have any knowledge regarding whether I could use these tools to detect structural variants in pooled data. I couldn’t find any specific tools in the case of pooled data and was wondering if the reasons that make some snp/small indel variant callers unfit for pooled data would also hold in the case of the structural variants. Thank you.

    eva

    August 31, 2014 at 5:44 am

    • Eva;
      Structural variant callers are better in calling in pooled or mixed populations since they are often used to detect low frequency events in cancer, which has a mix of subclones and non-even cellularity. Low frequency events are still more difficult to distinguish from false positives, but you’re not dealing with needing to relax assumptions inherent to the callers.

      We’re currently looking at evaluating structural calling in tumor/nomal samples to try and define the limits of detection more precisely, but hope this helps some.

      Brad Chapman

      September 1, 2014 at 12:50 pm

      • Thank you Brad, it does help. Sorry I am asking a lot about pooled data; it is a new territory for me. I appreciate your inputs.

        Eva

        September 2, 2014 at 2:16 am

  3. Hi Brad, great post!

    It would be also intersting to see comparison of the SV-callers for other types of structural variations. In parcticular, I am intersted in intra- and inter-chromosomal translocations. From my experience for this type of abberation the comparison is a little bit trickier. Apart from adjustment of caller’s settings one has to take into account 2 breakpoint positions and possible tolerance in defining them. Also, I am not aware of verifed reliable datasets which could be used for validation.

    Btw, I am working currently on the development of a novel method for detection of fusion genes and chimeric transcripts from RNA-seq data (https://bitbucket.org/kokonech/infusion/wiki/Home). This problem is somewhat similar to detection of SVs from whole genome data: both “split-read” and “discordant pairs” approaches are applicable here. To validate and compare my method to others I have also implemented a pipeline to compare results from different fusion gene callers (https://bitbucket.org/kokonech/infusion/src/master/comparison/). The pipeline is additionally capable of performing ‘ensemble’ approach for detecting overlapping results, which is similar to the one you described. Perhaps this functionality could be of interest to the community and even integrated in bcbio-nextgen in the future (along with an option to perform fusion calling from RNA-seq data). If this is the case, I will be glad to help.

    Konstantin Okonechnikov

    September 2, 2014 at 8:42 am

  4. Are the called variants used to make the plots above available anywhere? I would interested in seeing the data without re-running the callers completely.

    Thanks!

    Arun Ahuja

    October 27, 2014 at 9:16 am

  5. Hi Brad,
    Thank you for such an informative post. I am currently calling Structural variants for tumor/normal pairs (I have bam files mapped by BWA) . unfortunately, the HPC staff here are not willing to install bcbio (and speedseq) pipeline for me since they do not know how not to install the duplicated software bcbin internally uses. They had GATK, BWA etc installed already…Now, I have to use Lumpy just by itself.

    You mentioned that in your post ” The aligned reads from bwa stream directly into samblaster, adding minimal processing time to the run. For LUMPY calling, the pre-prepared split and discordant reads feed directly into speedseq”

    In the lumpy github page https://github.com/arq5x/lumpy-sv, it says to ” extract the reads that are either unmapped or have a soft clipped portion of at least 20 base pairs (by samtools), then use a split-read aligner (yaha) on the unmapped/soft clipped reads”

    My questions is that should I just use samblaster to extract discordant and split-reads or proceed as the lumpy page suggested?

    Thank you so much!

    Ming Tang

    Ming Tang

    April 6, 2015 at 11:59 am

    • Ming;
      Glad the post was useful for you. I believe the yaha workflow on the lumpy page is the older suggested approach and they now recommend using samblaster as we do in bcbio. The documentation was updated a few days ago with this updated recommendation (https://github.com/arq5x/lumpy-sv) as well.

      Regarding installation issues. At least for bcbio (and I think speedseq as well), everything gets installed in an isolated directory. So you will have duplicate versions of bwa and friends in different locations, but can determine what you use based on putting them in your PATH. Many HPC setups use modules for doing this.

      Hope this helps get things running for you.

      Brad Chapman

      April 6, 2015 at 2:30 pm

      • Hi Brad,
        Thanks for this update. I will go with Samblaster.

        Ming

        Ming Tang

        April 6, 2015 at 4:01 pm

      • Hi Brad. It has been a while and I have gotten calls from lumpy. I want to ask the filtering rules. I saw in the post that you require 4 supporting reads. In my calls, I saw some structural variants only have splitting reads (say 9 reads) support. I visualized it in pybamview and indeed 9 reads have 28 bases do not exist in the reference genome. Is it a real SV or do you think at least one discordant read is required to reduced the false positives? Thanks!

        Ming Tang

        December 1, 2015 at 4:04 pm

  6. Did you get a chance to compare this with Genome Strip as well ?

    Najeeb (@najeebashraf)

    February 29, 2016 at 4:00 am

    • Najeeb;
      Sorry we haven’t integrated GenomeSTRiP as part of bcbio for these validations. It would be great to hear experiences from other folks that have worked with it.

      Brad Chapman

      February 29, 2016 at 9:17 am


Leave a comment