Summarizing next-gen sequencing variation statistics with Hadoop using Cascalog

Improvements in next-generation sequencing technology are leading to ever increasing amounts of sequencing data. With this additional throughput comes the demand for algorithms and approaches that can easily scale. Hadoop offers an open source framework for batch processing large files. This post describes using Cascalog, a Hadoop query language written in Clojure, to investigate quality statistics for variant calling in deeply sequenced regions.

Biological question

The goal is to improve a variation calling algorithm for next-generation sequencing data. We have a densely sequenced region, where each position has thousands of potential base calls. Each position may be a single base, or a mix of of majority and minority variants. We are filtering variants on 3 metrics of quality:

Quality score — The sequencing technology’s assessment of the correctness of a base.
K-mer score — An estimate of the uniqueness of the region surrounding the base call position, built using khmer. Unique regions are more likely to be sequencing artifacts, while common regions are more likely to be real.
Mapping score — The aligner’s estimate of the reliability of the read alignment.

Each read and position is in a tab delimited file that looks like:

951     G       31      0.0515130211584 198

The training data has a set of known variable positions, and details about how the current variant calling algorithm did at each position:

951     T       false_positive  0.7
953     A       true_positive   4.0

We wanted to generate summary statistics at each position of interest, and look for additional criteria that could be built into the calling algorithm.

Writing cascalog queries

Cascalog is based on the Datalog rule language, a subset of Prolog. You describe the rules of a system and the query optimizer figures out how best to satisfy them; it requires a change of mindset from the more standard approach that you need to write detailed instructions about what to do.

Cascalog provides a high level of abstraction over Hadoop and Map-Reduce, so you focus entirely on writing the query. This post from Antonio Piccolboni compares several Hadoop languages; the post provides a nice side-by-side example of the brevity you can achieve with Cascalog.

The main query defines the outputs, retrieves input data from the snpdata and location target files described above, provides a count of reads at each position and base of interest, then averages the kmer, quality and mapping score metrics described earlier:

(defn calc-snpdata-stats [snpdata targets]
  (??<- [?chr ?pos ?base ?count ?avg-score ?avg-kmer-pct
         ?avg-qual ?avg-map ?type]
        (snpdata ?chr ?pos ?base ?qual ?kmer-pct ?map-score)
        (targets ?chr ?pos ?base ?type)
        (ops/count ?count)
        (ops/avg ?kmer-pct :> ?avg-kmer-pct)
        (ops/avg ?qual :> ?avg-qual)
        (ops/avg ?map-score :> ?avg-map)
        (combine-score ?kmer-pct ?qual ?map-score :> ?score)
        (ops/avg ?score :> ?avg-score)))

A big advantage of Cascalog is that it is just Clojure, so you can write custom queries in a full-featured language. The last two lines of the query define a custom score and its average at a position. The custom score is a linear combination of the min-max normalized scores:

(defn min-max-norm [score minv maxv]
  (let [trunc-score-max (if (< score maxv) score maxv)
        trunc-score (if (> trunc-score-max minv) trunc-score-max minv)]
    (/ (- trunc-score minv) (- maxv minv))))

(defmapop combine-score [kmer-pct qual map-score]
  (+ (min-max-norm kmer-pct 1e-5 0.10)
     (min-max-norm qual 4.0 35.0)
     (min-max-norm map-score 0.0 250.0)))

The final part of the code involves parsing the files and producing the snpdata and targets inputs to the query. That code splits each line in the input file and assigns the parts to the variables of interest:

(defmapop parse-snpdata-line [line]
  (let [[space pos base qual kmer-pct map-score] (split line #"\t")]
    [space (Integer/parseInt pos) base (Integer/parseInt qual)
     (Float/parseFloat kmer-pct) (Integer/parseInt map-score)]))

(defn snpdata-from-hfs [dir]
  (let 
    (<- [?chr ?pos ?base ?qual ?kmer-pct ?map-score]
        (source ?line)
        (parse-snpdata-line ?line :> ?chr ?pos ?base ?qual
                                     ?kmer-pct ?map-score))))

Running on Hadoop

The full project is available on GitHub. To run on a configured Hadoop system, you build the code, copy your input files to HDFS, then run:

% lein deps
% lein uberjar
% hadoop fs -mkdir /tmp/snp-assess/data
% hadoop fs -mkdir /tmp/snp-assess/positions
% hadoop fs -put your_variation_data.tsv /tmp/snp-assess/data
% hadoop fs -put positions_of_interest.tsv /tmp/snp-assess/positions
% hadoop jar snp-assess-0.0.1-SNAPSHOT-standalone.jar
             snp_assess.core /tmp/snp-assess/data /tmp/snp-assess/positions

The same code can also run locally without Hadoop. This is extremely useful for testing and development, or for smaller datasets that do not require the distributed power of Hadoop:

% lein deps
% lein run :snp-data /directory/of/varation/data /directory/of/positions

Both approaches generate tabular output with our positions, counts, scores and average metrics:

| 951 | T |  3 | 0.9 | 2.0e-04 | 24.7 | 55.7  | false_positive |
| 953 | A | 10 | 1.5 | 1.6e-02 | 23.1 | 175.5 | true_positive  |

Overview and additional projects

Cascalog provided an easy to use abstraction on top of Hadoop, which enabled exploration of densely mapped next-generation sequencing reads for variant detection. The code is free of scaling specific details, and instead focuses purely on the data of interest.

Another example of Cascalog in a biological setting is the answer I wrote to Pierre’s question on BioStar, dealing with overlapping genomic segments within Hadoop. The code is available from GitHub as an additional starting point for getting oriented with Hadoop and Cascalog.

Written by Brad Chapman

July 4, 2011 at 9:25 pm

Posted in analysis

Tagged with bioinformatics, cascalog, clojure, hadoop, ngs

3 Responses

Subscribe to comments with RSS.

Brad, very cool. One question re:

“Unique regions are more likely to be sequencing artifacts, while common regions are more likely to be real.”

How is that? I would have expected the inverse. Maybe I misunderstand what khmer is doing?

brent

July 6, 2011 at 10:09 am

Reply
- Brent;
  The logic was that kmers could be a proxy for the quality of the sequence regions surrounding a potential variant. So if we were looking at a kmer size of 5, had a region like:
```
GGATC
GGATC
GGGTC
GGGTC
GGTTT
```
  and were looking at the middle base, the A and G calls would be more likely than the T call, which might be a bad poly T stretch. The general idea was that one-off or very low frequency regions are more likely to be sequencing errors, while the real biological signal would be present more often. We used 13-mers for the real filtering.
  
  The idea was to push the quality filtering until the end so we could tweak the parameters. The tricky issue with pre-trimming reads is then you have to re-align and re-call for any tweaks so it becomes harder to adjust the filter.
  
  We’re using khmer only to do the kmer counting, so it’s a bit different than Titus’ initial intention. Hope this makes sense, and happy to discuss other ideas.
  Brad Chapman
  
  July 6, 2011 at 12:13 pm
  
  Reply
[…] Summarizing next-gen sequencing variation statistics with Hadoop using Cascalog – Improvements in next-generation sequencing technology are leading to ever increasing amounts of sequencing data. With this additional throughput comes the demand for algorithms and approaches that can easily scale. Hadoop offers an open source framework for batch processing large files. This post describes using Cascalog, a Hadoop query language written in Clojure, to investigate quality statistics for variant calling in deeply sequenced regions. … […]

Blue Collar Bioinformatics « malariagen informatics

August 2, 2011 at 6:44 am

Reply

Blue Collar Bioinformatics