Blue Collar Bioinformatics

Note: new posts have moved to Please look there for the latest updates and comments

Posts Tagged ‘visualization

Genomics X Prize public phase update: variant classification and de novo calling

with 7 comments


Last month I described our work at HSPH and EdgeBio preparing reference genomes for the Archon Genomics X Prize public phase, detailing methods used in the first version of our NA19239 variant calls. We’ve been steadily improving the calling approaches, and released version 0.2 on the X Prize validation website and GenomeSpace. Here I’ll describe the improvements we’ve made over the last month, focusing on two specific areas:

  • De novo calling: Zam Iqbal suggested using his cortex_var de novo variant caller in addition to the current GATK, FreeBayes and samtools callers. With his help, we’ve included these calls in this release, and provide comparisons between de novo and alignment based methods.
  • Improved variant classification: Consolidating variant calls from multiple callers involves making tough choices about when to include or exclude variants. I’ll describe the details of selecting metrics for use in SVM classification and filtering of variants.

Our goal is to iteratively improve our calling and variant preparation to create the best possible set of reference calls. I’d be happy to talk more with anyone working on similar problems or with insight into useful ways to improve our current callsets. We have a Get Satisfaction site for discussion and feedback, and have offered a $2500 prize for helpful comments.

As a reminder, all of the code and data used here is freely available:

  • The variant analysis infrastructure, built on top of GATK, automates genome preparation, normalization and comparison. It provides a full pipeline, driven by simple configuration files, for consolidating multiple variant calls.
  • The combined variant calls, including training data and potential true and false positives, are available from GenomeSpace: Public/chapmanb/xprize/NA19239-v0_2.
  • The individual variant calls for each technology and calling method are also available from GenomeSpace: Public/EdgeBio/PublicData/Release1.

de novo variant calling with cortex_var

de novo variant calling performs reference-free assembly of either local or global genome regions, then subsequently uses these assemblies to call variants relative to a known reference. The advantage is that assemblies can avoid errors associated with mapping to the reference, helping resolve large variations as well as small variations near problem alignments or low complexity regions.

Hybrid approaches that use localized de novo assembly in variant regions help mitigate the extensive computational requirements associated with whole-genome assembly. Complete Genomics variant calling and GATK 2.0’s Haplotype Caller both provide pipelines for hybrid de novo assembly in variant detection. The fermi and SGA assemblers are also used in variant calling, although the paths from assembly to variants are not as automated.

Thanks to Zam’s generous assistance, we used cortex_var for localized de novo assembly and variant calling within individual fosmid boundaries. As a result, CloudBioLinux now contains automated build instructions for cortex_var , handling binary builds for multiple k-mer and color combinations. An automated cortex_var pipeline, part of the bcbio-nextgen toolkit, runs the processing workflow:

  • Start with reads aligned to fosmid regions using novoalign.
  • For each fosmid region, extract all mapped reads along with a local reference genome for variant calling.
  • de novo assemble all reads in the fosmid region and call variants against the local reference genome using cortex_var’s Bubble Caller.
  • Remap regional variant coordinates back to the full genome.
  • Combine all regional calls into the final set of cortex_var calls.

Directly comparing GATK and cortex_var calls shows similar levels of concordance and discordance as the GATK/samtools comparison from the last post:

concordant: total 153787
concordant: SNPs 130913
concordant: indels 22874
GATK discordant: total 20495
GATK discordant: SNPs 6522
GATK discordant: indels 13973
cortex_var discordant: total 26790
cortex_var discordant: SNPs 21342
cortex_var discordant: indels 5448

11% of the GATK calls and 14% of the cortex_var calls are discordant. The one area where cortex_var does especially well is on indels: 19% of the cortex_var indels do not agree with GATK, in comparison with 37% of the GATK calls and 25% of the samtools calls. The current downside to this is SNP calling, where cortex_var has 3 times more discordant calls than GATK.

Selection of classification metrics

Overlapping variant calls from different calling methods (GATK, FreeBayes, samtools and cortex_var) and sequencing technologies (Illumina, SOLiD and IonTorrent) produces 170,286 potential calls in the fosmid regions. 58% (99,227) of these are present in all callers and technologies, so we need to do better than the intersection in creating a consolidated callset.

As detailed in the previous post, we filter the full set in two ways. The first is to keep trusted variants based on their presence in a defined number of technologies or callers. We currently have an inclusive set of criteria: keep variants present in either 4 out of the 7 callsets, 2 distinct technologies, or 3 distinct callers. This creates a trusted set containing 95% (162,202) of the variants. Longer term the goal is to reduce the trusted count and rely on automated filtering approaches based on input metrics.

This second automated filtering step uses a support vector machine (SVM) to evaluate the remaining variants. We train the SVM on potential true positives from variants that overlap in all callers and technologies, and potential false positives found uniquely in one single caller. The challenge is to find useful metrics associated with these training variants that will help provide discrimination.

In version 0.1 we filtered with a vanilla set of metrics: depth and variant quality score. To identify additional metrics, we used a great variant visualization tool developed by Keming Labs alongside HSPH and EdgeBio. I’ll write up more details about the tool once we have a demonstration website but the code is already available on GitHub.

To remove variants preferentially associated with poorly mapping or misaligned reads, we identified two useful metrics. ReadPosEndDist, written as a GATK annotation by Justin Zook at NIST, identifies variants primarily supported by calls at the ends of reads. Based on visual examination, these associate with difficult to map regions, as identified by Genome Mappability Scores:

Secondly, we identified problematic allele balances that differ from the expected ratios. For haploid fosmid calls, we expect 100% of reads to support variants and 0% to support reference calls (in diploid calls, you also need to handle heterozygotes with 50% expected allele balance). In practice, the distribution of reads can differ due to sequencer and alignment errors. We use a metric that measures deviation from the expected allele balance and associates closely with variant likelihoods:

Improved consolidated calls

To assess the influence of adding de novo calls and additional filtering metrics on the resulting call set, we compare against whole genome Illumina and Complete Genomics calls for NA19239. Previously we’d noticed two major issues during this comparison: a high percentage of discordant indel calls and a technology bias signaled by better concordance with Illumina than Complete.

The comparison between fosmid and Illumina data shows a substantial improvement in the indel bias. Previously 46% of the total indel calls were discordant, indicative of a potential false positive problem. With de novo calls and improved filtering, we’ve lowered this to only 10% of the total calls.

concordant: total 147684
concordant: SNPs 133861
concordant: indels 13823
fosmid discordant: total 7519
fosmid discordant: SNPs 5856
fosmid discordant: indels 1663
Illumina discordant: total 5640
Illumina discordant: SNPs 1843
Illumina discordant: indels 3797

This improvement comes with a decrease in the total number of concordant indel calls, since we moved from 17,816 calls to 13,823. However a large number of these seemed to be Illumina specific since 60% of the previous calls were discordant when compared with Complete Genomics. The new callset reduces this discrepancy and only 22% of the indels are now discordant:

concordant: total 139155
concordant: SNPs 127243
concordant: indels 11912
fosmid discordant: total 15484
fosmid discordant: SNPs 12028
fosmid discordant: indels 3456
Complete Genomics discordant: total 7311
Complete Genomics discordant: SNPs 4972
Complete Genomics discordant: indels 2273

These comparisons provide some nice confirmation that we’re moving in the right direction on filtering. As before, we extract potential false positives and false negatives to continue to refine the calls: potential false positives are those called in the fosmid dataset and in neither of the Illumina or Complete Genomics sets. Potential false negatives are calls that both Illumina and Complete agree on that the fosmid calls lack.

In the new callsets, there are 5,499 (3.5%) potential false positives and 1,422 (0.9%) potential false negatives. We’ve reduced potential false positives in the previous set from 10% with a slight increase in false negatives. These subsets are available along with the full callset on GenomeSpace. We’re also working hard on an NA12878 callset with equivalent approaches and will make that available soon for community feedback.

I hope this discussion, open source code, and dataset release is useful to everyone working on problems of improving variant calling accuracy and filtering. I welcome feedback on calling, consolidation methods, interesting metrics to explore, machine learning or any of the other topics discussed here.

Written by Brad Chapman

September 17, 2012 at 8:41 am

Talking at BOSC 2009 about publishing biological data on the web

with 3 comments

The Bioinformatics Open Source Conference (BOSC) is taking place later this month in Stockholm, Sweden. I will be attending for the first time in a few years, and giving a short and sweet 10 minute talk about ideas for publishing biological data on the web. BOSC provides a chance to meet and talk with many of the great people involved in open source bioinformatics; the schedule this year looks fantastic. The talk will be held in conjunction with The Data and Analysis Management special interest group, also full of interesting talks.

The talk will promote development of reusable web based interface libraries layered on top of existing open source projects. The PDF abstract provides the full description and motivation; below is a more detailed outline based on some brainstorming and organization:

  • Motivation: rapidly organize and display biological data in a web accessible format.
  • Current state: reusable bioinformatics libraries targeted at programmers — Biopython, bx-python, pygr, PyCogent
  • Current state: back end databases for storing biological data — BioSQL, GMOD
  • Current state: full featured web applications targeted at users — Galaxy, GBrowse
  • My situation: biologist and developer with organized data that needs analysis and presentation, internally with collaborators and externally with larger community.
  • Proposal: integrate bioinformatics libraries, database schemas, and open source web development frameworks to provide re-usable components that can serve as a base for custom data presentation.
  • Framework: utilize cloud infrastructure for reliable deployment — Google App Engine, Amazon EC2
  • Framework: make use of front end javascript frameworks — jQuery, ExtJS.
  • Framework: make use of back end web frameworks — Pylons
  • Implementation: Demo server for displaying sequences plus annotations
  • Implementation: Utilizes BioSQL schema, ported to object oriented data store; Google App engine backend or MongoDB backend
  • Implementation: Data import/export with Biopython libraries — GenBank in and GFF out
  • Implementation: Additional screenshots from internal web displays.
  • Challenges: Generalizing and organizing display and retrieval code without having to buy into a large framework.
  • Challenges: Re-usable components for cross-language functionality; javascript front end displays for multi-language back ends.
  • Challenges: Build a community that thinks of reusing and sharing display code as much as parsing and pipeline development code.

I would be happy to hear comments or suggestions about the talk. If you’re going to BOSC and want to meet up, definitely drop me a line.

Written by Brad Chapman

June 11, 2009 at 7:41 am

Finding and displaying short reads clustered in the genome

with one comment

Short read, or next-generation, sequencing is becoming increasingly prevalent in the day-to-day lives of bioinformatics researchers. Programs such as Bowtie, CloudBurst, Maq, and Arachne are available to align short reads to a reference genome. Once you have these alignments, the data analysis work begins. During a recent project, I was interested in looking at overlapping read groups across a genome; this post describes a method for collecting and visualizing those groups.

Starting with a file from an alignment program, the first step is to write a Python generator that returns information about the alignments. Since there are many different aligners, below is some Python pseudocode which describes the function:

def parse_alignment(align_file, errors_allowed=0):
    with open(align_file) as align_handle:
        for (read_id, match_id, match_start, match_end, errors) in \
            if (errors <= errors_allowed):
                yield read_id, match_id, match_start, match_end

It parses a line, checks if it has an acceptable number of alignment errors, and yields a unique id for the read, an id for the match like a chromosome or contig name, and the start and end coordinates of the match. We can use this generator to build our representation of overlapping reads with the help of the excellent bx-python library. bx-python contains a Python and C implementation of clustered interval trees. The function below uses the interface to generate ClusterTree objects for each chromosome/contig in your alignment:

import collections
from bx.intervals.cluster import ClusterTree

def build_cluster_trees(align_generator, cluster_distance):
    # arguments to ClusterTree are:
    # - Distance in basepairs for two reads to be in the same cluster;
    #   for instance 20 would group all reads with 20bp of each other
    # - Number of reads necessary for a group to be considered a cluster;
    #   2 returns all groups with 2 or more overlapping reads
    cluster_trees = collections.defaultdict(lambda:
            ClusterTree(cluster_distance, 2))
    for read_id, match_id, start, end in align_generator:
        cluster_trees[match_id].insert(start, end, read_id)
    return dict(cluster_trees)

The generated trees will readily provide a list of start and end coordinates for a clustered region, along with all of the reads that map to that region:

for chrom, cluster_tree in cluster_trees.items():
    for start, end, read_ids in cluster_tree.getregions():
        # do something with your read cluster

My interest was in visualizing the reads in these clusters along with the frequency of each read. To do so, I generated two python dictionaries which map the read_id identifiers, used in the grouping, to the chromosome location of the read (location_db) and the number of times a read was found in the raw short read results (freq_db). Practically, these are implemented as BerkeleyDB key value stores, to avoid having to keep all of the read information in memory.

With this information, we can use matplotlib to visualize the reads. Frequencies are represented using a color map. Each read in a group is plotted using horizontal bar graphs:

import pylab

def plot_reads(cluster_name, read_ids, location_db, freq_db):
    read_info = []
    for read_id in read_ids:
        start, end = location_db[read_id]
        freq = freq_db[read_id]
        read_info.append((start, end, freq))
    min_freq = min(l[2] for l in read_info)
    max_freq = max(l[2] for l in read_info)
    freq_cmap = pylab.get_cmap('Blues')
    for rindex, (start, end, freq) in enumerate(read_info):
        freq_percent = float(freq - min_freq) / float(max_freq - min_freq)
        color = freq_cmap(freq_percent)
        if freq in [min_freq, max_freq]:
            label = "Frequency: %s" % (freq)
            label = None
        if label and label not in labels_done:
            label = None
        pylab.barh(rindex, end - start, left=start, height=0.8, color=color,
    pylab.legend(loc='upper right')
    out_file = "%s.png" % (cluster_name)

The resulting graph for an example region provides a quick visual summary of read coverage, overlap, and reliability:

Example coverage plot

Written by Brad Chapman

April 29, 2009 at 4:55 pm

Automated protein conservation display from BLAST alignments

with 16 comments

Pawel at Freelancing Science had an excellent post about making structural predictions from visual analysis of BLAST alignments. This concept inspired me to put together an automated solution to visualize protein conservation. Starting with a protein of interest it will retrieve the relevant identifier from NCBI, do a remote BLAST, examine the alignments to related divergent proteins, calculate conservation and display a plot of the results.

With a protein identifier like a GenBank accession number or UniProt ID, we first need to find the standard NCBI GI number. Using Biopython’s Entrez module:

def search_for_gi(self, uniprot_id, db_name):
    handle = Entrez.esearch(db=db_name, term=uniprot_id)
    record =
    ids = record["IdList"]
    if len(ids) == 0:
        raise ValueError("Not found in NCBI: %s" % ids)
    return ids[0]

Given the GI, a remote BLAST is performed and the XML result is parsed into a record object. This is again done using Biopython libraries, with the BLAST result cached in case of re-runs. If you were using this code for many queries or on a web page presented to scientists, it would make sense to use a local BLAST for speed purposes. This could easily be plugged in, again using Biopython. Here is the remote version:

def remote_blast(self, search_gi, blast_method):
    out_file = os.path.join(self._cache_dir, "%s_%s_blo.xml" % (blast_method,
    if not os.path.exists(out_file):
        blast_handle = NCBIWWW.qblast(blast_method, "nr", search_gi)
        with open(out_file, 'w') as out_handle:
            for line in blast_handle:
    with open(out_file) as in_handle:
        rec_it = NCBIXML.parse(in_handle)

With the parsed record, the next step is to loop over the alignments to calculate conservation across the protein. To provide quantitative results, a protein substitution matrix provides a score for each BLAST alignment character pair. Higher scores indicate a more conserved alignment, with exact matches being the highest scores. We use the BLOSUM62 matrix here, although a wide variety are supported by Biopython. The class below loops through all of the alignments and high scoring pairs (HSP, in BLAST nomenclature), notes the position, and uses the alignment pairs and matrix to assign conservation scores at each position:

class BlastConservationCalculator:
    def __init__(self, matrix_name="blosum62"):
        self._subs_mat = getattr(MatrixInfo, matrix_name)
        self._no_use_thresh = 0.95

    def conservation_dict(self, blast_rec):
        cons_dict = {}
        rec_size = int(blast_rec.query_letters)
        for base_index in range(rec_size):
            cons_dict[base_index] = []
        for align in blast_rec.alignments:
            for hsp in align.hsps:
                if (float(hsp.identities) / float(rec_size) <=
                    cons_dict = self._add_hsp_conservation(hsp, cons_dict)
        return cons_dict

    def _add_hsp_conservation(self, hsp, cons_dict):
        start_index = int(hsp.query_start) - 1
        hsp_index = 0
        for q_index in range(len(hsp.query)):
            if (hsp.query&#91;q_index&#93; != '-'):
                if (hsp.sbjct&#91;q_index&#93; != '-'):
                        sub_val = self._subs_mat&#91;(hsp.query&#91;q_index&#93;,
                    except KeyError:
                        sub_val = self._subs_mat&#91;(hsp.sbjct&#91;q_index&#93;,
                    cons_dict&#91;start_index + hsp_index&#93;.append(sub_val)
                hsp_index += 1
        return cons_dict

The result of this work is a dictionary of score conservation at each position. If you plot the average of these scores directly, it results in a very choppy graph which is hard to interpret. Luckily, Andrew Dalke has tackled this problem and presented a detailed writeup of <a href="">smoothing scores for plotting</a>. Using the Savitzky-Golay technique described there, the smoothed average of the results are plotted using <a href="">matplotlib</a>:

window_size = 29
data_smoother = SavitzkyGolayDataSmoother(window_size)
pos_data = []
cons_data = []
for pos in indexes:
    pos_data.append(pos + 1)
    if len(cons_dict[pos]) > 0:
smooth_data = data_smoother.smooth_values(cons_data)
smooth_pos_data = pos_data[data_smoother.half_window():
        len(pos_data) - data_smoother.half_window()]
pylab.plot(smooth_pos_data, smooth_data)
pylab.axis(xmin=min(pos_data), xmax=max(pos_data))
pylab.xlabel("Amino acid position")
pylab.savefig('%s_conservation.png' % accession.replace(".", "_"))

The resulting plot was prepared for the example from Pawel’s post that inspired all this and is shown below. We can see the 4 regions of less conservation noted by Pawel by inspection of the alignment, along with the 3 intervening peaks of conservation:

Example conservation plot

The full script puts all of these parts together into a working version that could be used for your own proteins of interest. These plots are useful for getting a quick overview of protein conservation when working with a new protein. They could be used to compare with known regions like domains, to identify likely regions of importance for protein deletion studies, or to make structure evaluations. The intriguing aspect of the plots is the opportunity to quickly evaluate and make predictions for experimental study.

Written by Brad Chapman

February 7, 2009 at 5:18 pm

Graphing of variables before classification

with one comment

It is important to be able to graphically evaluate variables that feed into classification algorithms. In addition to assessing the distribution of the data, visual inspection also reveals patterns that can be later tested statistically. This post describes the preparation of a quick histogram for data from positive and negative examples feeding into a classifier. The excellent matplotlib library is used for visualization.

The example involves classifying two sets of proteins based on regional sequence charge. Two FASTA files were prepared, containing positive (active) and negative (non-active) examples. The goal is to look for a difference in charge between the two groups. Given a window size of amino acids to look at, we loop over the records in the file using the Biopython SeqIO interface:

def file_charges(in_file, cur_window):
    all_charges = []
    with open(in_file) as in_handle:
        for rec in SeqIO.parse(in_handle, "fasta"):
            cur_charges = calc_region_charges(rec.seq, cur_window)
    return all_charges

We also use Biopython to calculate the Isoelectric point of each protein window. This is used as a proxy for charge; higher isoelectric points correspond to higher charges over the region.

def calc_region_charges(seq, cur_window):
    # internal small regions, so do not deal with C and N terminal charges
    IsoelectricPoint.pKcterminal = {}
    IsoelectricPoint.pKnterminal = {}
    cur_pos = 0
    region_charges = []
    while cur_pos < len(seq) - cur_window:
        cur_seq = seq&#91;cur_pos:cur_pos + cur_window&#93;
        prot_analysis = ProtParam.ProteinAnalysis(str(cur_seq))
        ie_calc = IsoelectricPoint.IsoelectricPoint(cur_seq,
        cur_pos += 1
    return region_charges

With this in place, we get the charges for the example records and plot them side by side as a histogram using matplotlib:

cur_window = 75
pos_charges = file_charges(pos_file, cur_window)
neg_charges = file_charges(neg_file, cur_window)
n, bins, patches = pylab.hist([pos_charges, neg_charges], 30,
        normed=True, histtype='bar')
pylab.xlabel('Isoelectric point')
pylab.ylabel('Normalized percent of regions')
pylab.title('Protein charge of %s amino acid windows' % cur_window)
pylab.legend([patches[0][0], patches[1][0]], ['positives', 'negatives'])

The resulting graph shows the different distribution of charge between the positive and negative records. At an isoelectric point just above 10, only sequence windows from the positive examples are found. Looking at the percentage of highly charged sequence regions above this threshold in non-classified sequences could serve as the basis for a classifier.

Example histogram

Example histogram

The example script puts all of this together, and could be used as a template for classification problems in your own research.

Written by Brad Chapman

January 24, 2009 at 1:26 pm