Sorting genomic alignments using Python
Recent, I was asked to sort a MAF (Multiple Alignment Format) file containing genomic alignments. This was a large vertebrate alignment file and the researchers were interested in manually examining the longest matches. The tricky part of doing this is that the entire file cannot comfortably fit in memory on a standard machine.
The common solution to dealing with large memory intensive files is to pre-parse the file and provide a way to look up individual records based on an identifier. For instance, items could be pushed into a relational database table and retrieved based on primary keys. While moving to disk access is less efficient, it allows your code to scale to these type of real life problems without resorting to buying large memory machines.
We will use the bx-python library to read the MAF alignment files, prepare an index for accessing the alignments individually, and ultimately perform the sorting.
The first step is to define the index retrieval method. bx-python has index retrival functionality that allows querying and retrieving records based on genomic location. We don’t need that advanced query capability here, and can retrieve records based on position in the file as well. We build the index by looping through each record and recording the file position and genomic locations in the index.
from bx.align import maf from bx import interval_index_file def build_index(in_file, index_file): indexes = interval_index_file.Indexes() with open(in_file) as in_handle: reader = maf.Reader(in_handle) while 1: pos = reader.file.tell() rec = reader.next() if rec is None: break for c in rec.components: indexes.add(c.src, c.forward_strand_start, c.forward_strand_end, pos, max=c.src_size ) with open(index_file, "w") as index_handle: indexes.write(index_handle)
Now we have a unique index to retrieve a record — the position in the file. The second step is to loop through the file again and build a list of alignment sizes and positions. Alignment size is the first value in each list item, since this is the criteria to sort by. This list easily fits in memory, and we sort it with the largest alignments first.
rec_info =  with open(in_file) as in_handle: reader = maf.Reader(in_handle) while 1: pos = reader.file.tell() rec = reader.next() if rec is None: break rec_info.append((rec.text_size, pos)) rec_info.sort(reverse=True)
Finally, we loop though the sorted list of sizes and use the associated positions to get records from our index. Each record is then written to an output file.
index = maf.Indexed(in_file, index_file) with open(out_file, "w") as out_handle: writer = maf.Writer(out_handle) for size, pos in rec_info: rec = index.get_at_offset(pos) writer.write(rec)