Finding and displaying short reads clustered in the genome
Short read, or next-generation, sequencing is becoming increasingly prevalent in the day-to-day lives of bioinformatics researchers. Programs such as Bowtie, CloudBurst, Maq, and Arachne are available to align short reads to a reference genome. Once you have these alignments, the data analysis work begins. During a recent project, I was interested in looking at overlapping read groups across a genome; this post describes a method for collecting and visualizing those groups.
Starting with a file from an alignment program, the first step is to write a Python generator that returns information about the alignments. Since there are many different aligners, below is some Python pseudocode which describes the function:
def parse_alignment(align_file, errors_allowed=0): with open(align_file) as align_handle: for (read_id, match_id, match_start, match_end, errors) in \ your_parser.read_iterator(align_handle): if (errors <= errors_allowed): yield read_id, match_id, match_start, match_end
It parses a line, checks if it has an acceptable number of alignment errors, and yields a unique id for the read, an id for the match like a chromosome or contig name, and the start and end coordinates of the match. We can use this generator to build our representation of overlapping reads with the help of the excellent bx-python library. bx-python contains a Python and C implementation of clustered interval trees. The function below uses the interface to generate
ClusterTree objects for each chromosome/contig in your alignment:
import collections from bx.intervals.cluster import ClusterTree def build_cluster_trees(align_generator, cluster_distance): # arguments to ClusterTree are: # - Distance in basepairs for two reads to be in the same cluster; # for instance 20 would group all reads with 20bp of each other # - Number of reads necessary for a group to be considered a cluster; # 2 returns all groups with 2 or more overlapping reads cluster_trees = collections.defaultdict(lambda: ClusterTree(cluster_distance, 2)) for read_id, match_id, start, end in align_generator: cluster_trees[match_id].insert(start, end, read_id) return dict(cluster_trees)
The generated trees will readily provide a list of start and end coordinates for a clustered region, along with all of the reads that map to that region:
for chrom, cluster_tree in cluster_trees.items(): for start, end, read_ids in cluster_tree.getregions(): # do something with your read cluster
My interest was in visualizing the reads in these clusters along with the frequency of each read. To do so, I generated two python dictionaries which map the
read_id identifiers, used in the grouping, to the chromosome location of the read (
location_db) and the number of times a read was found in the raw short read results (
freq_db). Practically, these are implemented as BerkeleyDB key value stores, to avoid having to keep all of the read information in memory.
import pylab def plot_reads(cluster_name, read_ids, location_db, freq_db): read_info =  for read_id in read_ids: start, end = location_db[read_id] freq = freq_db[read_id] read_info.append((start, end, freq)) read_info.sort(reverse=True) min_freq = min(l for l in read_info) max_freq = max(l for l in read_info) freq_cmap = pylab.get_cmap('Blues') pylab.clf() for rindex, (start, end, freq) in enumerate(read_info): freq_percent = float(freq - min_freq) / float(max_freq - min_freq) color = freq_cmap(freq_percent) if freq in [min_freq, max_freq]: label = "Frequency: %s" % (freq) else: label = None if label and label not in labels_done: labels_done.append(label) else: label = None pylab.barh(rindex, end - start, left=start, height=0.8, color=color, label=label) pylab.title(cluster_name) pylab.xlabel("Coordinates") pylab.ylabel("Reads") pylab.legend(loc='upper right') pylab.yticks(()) out_file = "%s.png" % (cluster_name) pylab.savefig(out_file)
The resulting graph for an example region provides a quick visual summary of read coverage, overlap, and reliability: