Finding and displaying short reads clustered in the genome

Short read, or next-generation, sequencing is becoming increasingly prevalent in the day-to-day lives of bioinformatics researchers. Programs such as Bowtie, CloudBurst, Maq, and Arachne are available to align short reads to a reference genome. Once you have these alignments, the data analysis work begins. During a recent project, I was interested in looking at overlapping read groups across a genome; this post describes a method for collecting and visualizing those groups.

Starting with a file from an alignment program, the first step is to write a Python generator that returns information about the alignments. Since there are many different aligners, below is some Python pseudocode which describes the function:

def parse_alignment(align_file, errors_allowed=0):
    with open(align_file) as align_handle:
        for (read_id, match_id, match_start, match_end, errors) in \
                your_parser.read_iterator(align_handle):
            if (errors <= errors_allowed):
                yield read_id, match_id, match_start, match_end

It parses a line, checks if it has an acceptable number of alignment errors, and yields a unique id for the read, an id for the match like a chromosome or contig name, and the start and end coordinates of the match. We can use this generator to build our representation of overlapping reads with the help of the excellent bx-python library. bx-python contains a Python and C implementation of clustered interval trees. The function below uses the interface to generate ClusterTree objects for each chromosome/contig in your alignment:

import collections
from bx.intervals.cluster import ClusterTree

def build_cluster_trees(align_generator, cluster_distance):
    # arguments to ClusterTree are:
    # - Distance in basepairs for two reads to be in the same cluster;
    #   for instance 20 would group all reads with 20bp of each other
    # - Number of reads necessary for a group to be considered a cluster;
    #   2 returns all groups with 2 or more overlapping reads
    cluster_trees = collections.defaultdict(lambda:
            ClusterTree(cluster_distance, 2))
    for read_id, match_id, start, end in align_generator:
        cluster_trees[match_id].insert(start, end, read_id)
    return dict(cluster_trees)

The generated trees will readily provide a list of start and end coordinates for a clustered region, along with all of the reads that map to that region:

for chrom, cluster_tree in cluster_trees.items():
    for start, end, read_ids in cluster_tree.getregions():
        # do something with your read cluster

My interest was in visualizing the reads in these clusters along with the frequency of each read. To do so, I generated two python dictionaries which map the read_id identifiers, used in the grouping, to the chromosome location of the read (location_db) and the number of times a read was found in the raw short read results (freq_db). Practically, these are implemented as BerkeleyDB key value stores, to avoid having to keep all of the read information in memory.

With this information, we can use matplotlib to visualize the reads. Frequencies are represented using a color map. Each read in a group is plotted using horizontal bar graphs:

import pylab

def plot_reads(cluster_name, read_ids, location_db, freq_db):
    read_info = []
    for read_id in read_ids:
        start, end = location_db[read_id]
        freq = freq_db[read_id]
        read_info.append((start, end, freq))
    read_info.sort(reverse=True)
    min_freq = min(l[2] for l in read_info)
    max_freq = max(l[2] for l in read_info)
    freq_cmap = pylab.get_cmap('Blues')
    pylab.clf()
    for rindex, (start, end, freq) in enumerate(read_info):
        freq_percent = float(freq - min_freq) / float(max_freq - min_freq)
        color = freq_cmap(freq_percent)
        if freq in [min_freq, max_freq]:
            label = "Frequency: %s" % (freq)
        else:
            label = None
        if label and label not in labels_done:
            labels_done.append(label)
        else:
            label = None
        pylab.barh(rindex, end - start, left=start, height=0.8, color=color,
                label=label)
    pylab.title(cluster_name)
    pylab.xlabel("Coordinates")
    pylab.ylabel("Reads")
    pylab.legend(loc='upper right')
    pylab.yticks(())
    out_file = "%s.png" % (cluster_name)
    pylab.savefig(out_file)

The resulting graph for an example region provides a quick visual summary of read coverage, overlap, and reliability:

Example coverage plot

Written by Brad Chapman

April 29, 2009 at 4:55 pm

Posted in visualization

Tagged with bioinformatics, matplotlib, ngs, python, visualization

One Response

Subscribe to comments with RSS.

[…] example 1, […]

Python learning resources | Wei Shen's Note

August 4, 2015 at 9:18 pm

Reply

Blue Collar Bioinformatics