Summarizing next-gen sequencing variation statistics with Hadoop using Cascalog
Improvements in next-generation sequencing technology are leading to ever increasing amounts of sequencing data. With this additional throughput comes the demand for algorithms and approaches that can easily scale. Hadoop offers an open source framework for batch processing large files. This post describes using Cascalog, a Hadoop query language written in Clojure, to investigate quality statistics for variant calling in deeply sequenced regions.
The goal is to improve a variation calling algorithm for next-generation sequencing data. We have a densely sequenced region, where each position has thousands of potential base calls. Each position may be a single base, or a mix of of majority and minority variants. We are filtering variants on 3 metrics of quality:
- Quality score — The sequencing technology’s assessment of the correctness of a base.
- K-mer score — An estimate of the uniqueness of the region surrounding the base call position, built using khmer. Unique regions are more likely to be sequencing artifacts, while common regions are more likely to be real.
- Mapping score — The aligner’s estimate of the reliability of the read alignment.
Each read and position is in a tab delimited file that looks like:
951 G 31 0.0515130211584 198
The training data has a set of known variable positions, and details about how the current variant calling algorithm did at each position:
951 T false_positive 0.7 953 A true_positive 4.0
We wanted to generate summary statistics at each position of interest, and look for additional criteria that could be built into the calling algorithm.
Writing cascalog queries
Cascalog is based on the Datalog rule language, a subset of Prolog. You describe the rules of a system and the query optimizer figures out how best to satisfy them; it requires a change of mindset from the more standard approach that you need to write detailed instructions about what to do.
Cascalog provides a high level of abstraction over Hadoop and Map-Reduce, so you focus entirely on writing the query. This post from Antonio Piccolboni compares several Hadoop languages; the post provides a nice side-by-side example of the brevity you can achieve with Cascalog.
The main query defines the outputs, retrieves input data from the
snpdata and location
target files described above, provides a count of reads at each position and base of interest, then averages the kmer, quality and mapping score metrics described earlier:
(defn calc-snpdata-stats [snpdata targets] (??<- [?chr ?pos ?base ?count ?avg-score ?avg-kmer-pct ?avg-qual ?avg-map ?type] (snpdata ?chr ?pos ?base ?qual ?kmer-pct ?map-score) (targets ?chr ?pos ?base ?type) (ops/count ?count) (ops/avg ?kmer-pct :> ?avg-kmer-pct) (ops/avg ?qual :> ?avg-qual) (ops/avg ?map-score :> ?avg-map) (combine-score ?kmer-pct ?qual ?map-score :> ?score) (ops/avg ?score :> ?avg-score)))
A big advantage of Cascalog is that it is just Clojure, so you can write custom queries in a full-featured language. The last two lines of the query define a custom score and its average at a position. The custom score is a linear combination of the min-max normalized scores:
(defn min-max-norm [score minv maxv] (let [trunc-score-max (if (< score maxv) score maxv) trunc-score (if (> trunc-score-max minv) trunc-score-max minv)] (/ (- trunc-score minv) (- maxv minv)))) (defmapop combine-score [kmer-pct qual map-score] (+ (min-max-norm kmer-pct 1e-5 0.10) (min-max-norm qual 4.0 35.0) (min-max-norm map-score 0.0 250.0)))
The final part of the code involves parsing the files and producing the
targets inputs to the query. That code splits each line in the input file and assigns the parts to the variables of interest:
(defmapop parse-snpdata-line [line] (let [[space pos base qual kmer-pct map-score] (split line #"\t")] [space (Integer/parseInt pos) base (Integer/parseInt qual) (Float/parseFloat kmer-pct) (Integer/parseInt map-score)])) (defn snpdata-from-hfs [dir] (let 1 (<- [?chr ?pos ?base ?qual ?kmer-pct ?map-score] (source ?line) (parse-snpdata-line ?line :> ?chr ?pos ?base ?qual ?kmer-pct ?map-score))))
Running on Hadoop
% lein deps % lein uberjar % hadoop fs -mkdir /tmp/snp-assess/data % hadoop fs -mkdir /tmp/snp-assess/positions % hadoop fs -put your_variation_data.tsv /tmp/snp-assess/data % hadoop fs -put positions_of_interest.tsv /tmp/snp-assess/positions % hadoop jar snp-assess-0.0.1-SNAPSHOT-standalone.jar snp_assess.core /tmp/snp-assess/data /tmp/snp-assess/positions
The same code can also run locally without Hadoop. This is extremely useful for testing and development, or for smaller datasets that do not require the distributed power of Hadoop:
% lein deps % lein run :snp-data /directory/of/varation/data /directory/of/positions
Both approaches generate tabular output with our positions, counts, scores and average metrics:
| 951 | T | 3 | 0.9 | 2.0e-04 | 24.7 | 55.7 | false_positive | | 953 | A | 10 | 1.5 | 1.6e-02 | 23.1 | 175.5 | true_positive |
Overview and additional projects
Cascalog provided an easy to use abstraction on top of Hadoop, which enabled exploration of densely mapped next-generation sequencing reads for variant detection. The code is free of scaling specific details, and instead focuses purely on the data of interest.
Another example of Cascalog in a biological setting is the answer I wrote to Pierre’s question on BioStar, dealing with overlapping genomic segments within Hadoop. The code is available from GitHub as an additional starting point for getting oriented with Hadoop and Cascalog.