Blue Collar Bioinformatics

Note: new posts have moved to Please look there for the latest updates and comments

Posts Tagged ‘python

Trimming adaptors from short read sequences

with 11 comments

One common post processing step for next generation, or short read, sequencing is to remove adaptor sequences. Sample preparation for sequencing often involves selection of a particular fraction of interest like expressed genes, microRNAs, or copy number variant regions. Selecting, amplifying, and sequencing these regions can result in read sequences containing a common tag that needs to be identified and removed before continuing with other analysis steps like alignment to the genome. This is often the case when the reads of interest are small like in short RNAs or small sequence tags.

We start with a short known sequence (the adaptor) that will be present in our sequences of interest. For each read, we want to determine the position of this adaptor and remove it. Additionally, reads may contain the adaptor but differ by one or more bases; these differences are likely due to sequencing errors. So our solution needs to allow some fuzzy matching of the adaptor to the read sequences.

My first attempt involved using fuzzy string matching. In Python, several libraries exist to tackle this problem. The normal use case is in identifying misspelled words in text. For instance, you can calculate Levenshtein (edit) distance or use difflib in the standard library; Stack Overflow has a nice discussion of the different modules. Ultimately this approach proved to be a dead end; the properties that make English words similar do not overlap well with those differentiate DNA sequences.

That realization led me to tackle the problem by computing local alignments of the adaptor to each sequence. This effectively captures the biological intuition you need to trim a sequence with less than a specified number of errors. The downside to this rigorous approach is that it will be slower than purely string based methods.

The Biopython library contains a Python local alignment function suitable for quick alignment of short regions. We use this to align an adaptor region to a sequence and calculate the number of differences in the aligned region. To handle a common case where we can find the exact adaptor sequence, we first do an string match. This avoids the alignment overhead in many cases:

from Bio import pairwise2

def trim_adaptor(seq, adaptor, num_errors, right_side=True):
    gap_char = '-'
    exact_pos = str(seq).find(adaptor)
    if exact_pos >= 0:
        seq_region = str(seq[exact_pos:exact_pos+len(adaptor)])
        adapt_region = adaptor
        seq_a, adaptor_a, score, start, end = pairwise2.align.localms(str(seq),
                str(adaptor), 5.0, -4.0, -9.0, -0.5,
                one_alignment_only=True, gap_char=gap_char)[0]
        adapt_region = adaptor_a[start:end]
        seq_region = seq_a[start:end]
    matches = sum((1 if s == adapt_region[i] else 0) for i, s in
    # too many errors -- no trimming
    if (len(adaptor) - matches) > num_errors:
        return seq
    # remove the adaptor sequence and return the result
        return _remove_adaptor(seq, seq_region.replace(gap_char, ""),

If the alignment contains fewer than the specified number of errors we’ve found an adaptor and remove it, returning the trimmed sequence. The attribute right_side allows us to specify whether the trimming should be done on the right (3′ end) or the left (5′ end):

def _remove_adaptor(seq, region, right_side=True):
    """Remove an adaptor region and all sequence to the right or left.
    if right_side:
            pos = seq.find(region)
        # handle Biopython SeqRecords
        except AttributeError:
            pos = seq.seq.find(region)
        return seq[:pos]
            pos = seq.rfind(region)
        # handle Biopython SeqRecords
        except AttributeError:
            pos = seq.seq.rfind(region)
        return seq[pos+len(region):]

Here is an example script using this function along with Biopython to trim reads from a FASTQ format file.Each read is analyzed to determine if it contains the adaptor, with up to 2 errors. If the adaptor is identified and trimmed, the final useful read is written to a FASTA format output file.

from __future__ import with_statement
import sys
import os

from Bio import SeqIO

from adaptor_trim import trim_adaptor

def main(in_file):
    adaptor_seq = "TCGTATGCCGTCTTCTGC"
    num_errors = 2
    base, ext = os.path.splitext(in_file)
    out_file = "%s-trimmed.fa" % (base)
    with open(in_file) as in_handle:
        with open(out_file, "w") as out_handle:
            for rec in SeqIO.parse(in_handle, "fastq"):
                trim = trim_adaptor(rec.seq, adaptor_seq, num_errors)
                if len(trim) > 0 and len(trim) < len(rec):
                    rec.letter_annotations = {}
                    rec.seq = trim
                    SeqIO.write([rec], out_handle, "fasta")

You can get the full trimming code, along with tests, from GitHub. The script takes about a half hour to process 15 million 36bp reads. Other current trimming approaches are often integrated into the aligners themselves: for instance, NovoAlign appears to have similar alignment based trimming capabilities.

Written by Brad Chapman

August 9, 2009 at 8:18 pm

Posted in OpenBio

Tagged with , , ,

Sorting genomic alignments using Python

leave a comment »

Recent, I was asked to sort a MAF (Multiple Alignment Format) file containing genomic alignments. This was a large vertebrate alignment file and the researchers were interested in manually examining the longest matches. The tricky part of doing this is that the entire file cannot comfortably fit in memory on a standard machine.

The common solution to dealing with large memory intensive files is to pre-parse the file and provide a way to look up individual records based on an identifier. For instance, items could be pushed into a relational database table and retrieved based on primary keys. While moving to disk access is less efficient, it allows your code to scale to these type of real life problems without resorting to buying large memory machines.

We will use the bx-python library to read the MAF alignment files, prepare an index for accessing the alignments individually, and ultimately perform the sorting.

The first step is to define the index retrieval method. bx-python has index retrival functionality that allows querying and retrieving records based on genomic location. We don’t need that advanced query capability here, and can retrieve records based on position in the file as well. We build the index by looping through each record and recording the file position and genomic locations in the index.

from bx.align import maf
from bx import interval_index_file

def build_index(in_file, index_file):
    indexes = interval_index_file.Indexes()
    with open(in_file) as in_handle:
        reader = maf.Reader(in_handle)
        while 1:
            pos = reader.file.tell()
            rec =
            if rec is None:
            for c in rec.components:
                indexes.add(c.src, c.forward_strand_start,
                        c.forward_strand_end, pos, max=c.src_size )

    with open(index_file, "w") as index_handle:

Now we have a unique index to retrieve a record — the position in the file. The second step is to loop through the file again and build a list of alignment sizes and positions. Alignment size is the first value in each list item, since this is the criteria to sort by. This list easily fits in memory, and we sort it with the largest alignments first.

rec_info = []
with open(in_file) as in_handle:
    reader = maf.Reader(in_handle)
    while 1:
        pos = reader.file.tell()
        rec =
        if rec is None:
        rec_info.append((rec.text_size, pos))

Finally, we loop though the sorted list of sizes and use the associated positions to get records from our index. Each record is then written to an output file.

index = maf.Indexed(in_file, index_file)
with open(out_file, "w") as out_handle:
    writer = maf.Writer(out_handle)
    for size, pos in rec_info:
        rec = index.get_at_offset(pos)

The full script puts this together into a usable program. This could be adjusted to deal with other alignment formats supported by bx-python, like axt files.

Written by Brad Chapman

July 26, 2009 at 3:52 pm

Posted in OpenBio

Tagged with , ,

Thoughts from BOSC 2009; Python in bioinformatics

with 6 comments

I’d like to share my thoughts on two major themes that emerged from my trip to Sweden for the 2009 Bioinformatics Open Source Conference (BOSC). I talked briefly on publishing biological data on the web; you can check out the slides on Slideshare. This lead to discussion with several folks from the open source and Python bioinformatics communities. The major themes of these conversations were: organization within the Python bioinformatics community and growth of a platform for developing web enabled applications.

Python in Bioinformatics

One of the unique elements of the Python bioinformatics community is that the work is distributed amongst several different packages. Unlike Perl, where many programmers regularly consolidate their code into the BioPerl project, the Python community has settled on a few different packages: Biopython, bx-python, pygr and PyCogent are a few popular ones. Instead of working with a monolithic code base, users can pick functionality amongst several choices.

This distinctive organization is not a bad thing. It avoids creating an unwieldy package which can be hard to maintain and re-factor. It allows programmers to explore solutions in ways better suited to their particular problems. Finally, it provides individual recognition for the hard work researchers put into building and maintaining reusable code.

What the distribution does mean is that the Python bioinformatics community needs to work harder at communication and coordination. One great idea was to write a grant to help bring Python biology programmers together for a conference and hacking session, ideally alongside a conference like BOSC or SciPy. This would provide an impetus for contributors to learn and discuss each others platforms. Beyond goodwill and community, the deliverables would be documentation and code contributing to integration between projects. This would enable scientists by lowering the learning curves to producing useful biology related code in Python.


My talk focused on developing small, reusable presentation and backend components for building web enabled applications. One really insightful question asked whether the community should focus on building a platform these components could be plugged into, or if components themselves would eventually evolve towards a larger structure. The first idea was taken very successfully by the Firefox browser with their plugin architecture, which allows end users to build amazing web interfacing applications within the browser. The second approach is the one I am more used to taking: build relatively smaller things that work, with an eye towards integration.

A discussion with James Taylor convinced me that it was worthwhile to take a longer look at the Galaxy project. Galaxy is an excellent web based front end to many bioinformatics programs and scripts, allowing biologists to put together analysis pipelines. They have a powerful public site which means using Galaxy requires no installation for many use cases.

The code is also publicly available and you can run your own local Galaxy instances, plugging in custom programs through their XML based tool interface. The architecture of the system is remarkably similar to the one I converged on for my work. It features a Pylons like backend, SQLAlchemy for databases interactivity, and jQuery for the javascript interface. Deployment to cloud infrastructure can be done on Amazon EC2.

We have installed Galaxy locally and are taking it for a spin for our data presentation tasks. The tool plugin interface works as described and we have had good luck integrating it with custom input types. I will be trying more complex integrations with custom display and more Python code on the backend and hopefully have future posts covering that. Generally, I hope Galaxy can serve as a platform in which custom presentation code can be built, distributed, and reused.

I’d be happy to hear your thoughts about either the biology Python community or Galaxy as a platform for web presentation work.

Written by Brad Chapman

July 19, 2009 at 8:13 pm

Posted in OpenBio

Tagged with , ,

Finding and displaying short reads clustered in the genome

with one comment

Short read, or next-generation, sequencing is becoming increasingly prevalent in the day-to-day lives of bioinformatics researchers. Programs such as Bowtie, CloudBurst, Maq, and Arachne are available to align short reads to a reference genome. Once you have these alignments, the data analysis work begins. During a recent project, I was interested in looking at overlapping read groups across a genome; this post describes a method for collecting and visualizing those groups.

Starting with a file from an alignment program, the first step is to write a Python generator that returns information about the alignments. Since there are many different aligners, below is some Python pseudocode which describes the function:

def parse_alignment(align_file, errors_allowed=0):
    with open(align_file) as align_handle:
        for (read_id, match_id, match_start, match_end, errors) in \
            if (errors <= errors_allowed):
                yield read_id, match_id, match_start, match_end

It parses a line, checks if it has an acceptable number of alignment errors, and yields a unique id for the read, an id for the match like a chromosome or contig name, and the start and end coordinates of the match. We can use this generator to build our representation of overlapping reads with the help of the excellent bx-python library. bx-python contains a Python and C implementation of clustered interval trees. The function below uses the interface to generate ClusterTree objects for each chromosome/contig in your alignment:

import collections
from bx.intervals.cluster import ClusterTree

def build_cluster_trees(align_generator, cluster_distance):
    # arguments to ClusterTree are:
    # - Distance in basepairs for two reads to be in the same cluster;
    #   for instance 20 would group all reads with 20bp of each other
    # - Number of reads necessary for a group to be considered a cluster;
    #   2 returns all groups with 2 or more overlapping reads
    cluster_trees = collections.defaultdict(lambda:
            ClusterTree(cluster_distance, 2))
    for read_id, match_id, start, end in align_generator:
        cluster_trees[match_id].insert(start, end, read_id)
    return dict(cluster_trees)

The generated trees will readily provide a list of start and end coordinates for a clustered region, along with all of the reads that map to that region:

for chrom, cluster_tree in cluster_trees.items():
    for start, end, read_ids in cluster_tree.getregions():
        # do something with your read cluster

My interest was in visualizing the reads in these clusters along with the frequency of each read. To do so, I generated two python dictionaries which map the read_id identifiers, used in the grouping, to the chromosome location of the read (location_db) and the number of times a read was found in the raw short read results (freq_db). Practically, these are implemented as BerkeleyDB key value stores, to avoid having to keep all of the read information in memory.

With this information, we can use matplotlib to visualize the reads. Frequencies are represented using a color map. Each read in a group is plotted using horizontal bar graphs:

import pylab

def plot_reads(cluster_name, read_ids, location_db, freq_db):
    read_info = []
    for read_id in read_ids:
        start, end = location_db[read_id]
        freq = freq_db[read_id]
        read_info.append((start, end, freq))
    min_freq = min(l[2] for l in read_info)
    max_freq = max(l[2] for l in read_info)
    freq_cmap = pylab.get_cmap('Blues')
    for rindex, (start, end, freq) in enumerate(read_info):
        freq_percent = float(freq - min_freq) / float(max_freq - min_freq)
        color = freq_cmap(freq_percent)
        if freq in [min_freq, max_freq]:
            label = "Frequency: %s" % (freq)
            label = None
        if label and label not in labels_done:
            label = None
        pylab.barh(rindex, end - start, left=start, height=0.8, color=color,
    pylab.legend(loc='upper right')
    out_file = "%s.png" % (cluster_name)

The resulting graph for an example region provides a quick visual summary of read coverage, overlap, and reliability:

Example coverage plot

Written by Brad Chapman

April 29, 2009 at 4:55 pm

Initial GFF parser for Biopython

with 10 comments

Generic feature format (GFF) is a nice plain text file format for storing annotations on biological sequences, and would be very useful tied in with the BioSQL relational database. Two weeks ago, I detailed the Bioperl GenBank to GFF mapping, which provided some introductory background to the problem. Here I’m continuing to explore GFF and BioSQL together, but from the opposite direction.

I implemented an initial pass at a python GFF parser that will hopefully eventually be included in Biopython. You can find the current code in the git repository. I’d be very happy to have others help with development, provide usage feedback and pass along difficult GFF files for testing.


GFF parsing is a little tricker to fit into the Biopython SeqIO system than other sequence file formats. Formats like GenBank or UniProt contain a sequence and its features combined into a single record. This allows parsers to iterate over files one at a time, returning generic sequence objects from each record. This scales with large files so your memory and processor worries are bounded by the most complicated record in the file.

In contrast, GFF files are separate from the primary sequences and do not have any guarantees about annotations for a record being grouped together. To be sure you’ve picked up all features for a record, you need to parse the entire GFF file. For large real-life files this becomes a problem as all of the features being added will rapidly fill up available memory.

To solve these problems, GFF parsing is implemented here as a feature addition module with filtering. This means that you first use standard Biopython SeqIO to parse the base sequence records, and then use the GFF class to add features to these initial records. The addition function has an option argument allowing added features to be limited to a subset of interest. You can limit based on record names and add all features related to a specific sequence, or you can limit based on types of GFF features and add these features to all records in the file.

This example demonstrates the use of the GFF parser to parse out all of the coding sequence features for chromosome one ('I'), and add them to the initial record:

from Bio import SeqIO
from BCBio.SeqIO.GFFIO import GFFFeatureAdder

with open(seq_file) as seq_handle:
    seq_dict = SeqIO.to_dict(SeqIO.parse(seq_handle, "fasta"))
feature_adder = GFFFeatureAdder(seq_dict)
cds_limit_info = dict(
        gff_types = [('Coding_transcript', 'gene'),
                     ('Coding_transcript', 'mRNA'),
                     ('Coding_transcript', 'CDS')],
        gff_id = ['I']
with open(gff_file) as gff_handle:
    feature_adder.add_features(gff_handle, cds_limit_info)
final_rec = feature_adder.base['I']

This example shows the other unique aspect of GFF parsing: nested features. In the example above we pull out coding genes, mRNA transcripts, and coding sequences (CDS). These are nested as a gene can have multiple mRNAs, and CDSs are mapped to one or more mRNA transcripts. In Biopython this is handled naturally using the sub_feature attribute of SeqFeature. So when handling the record, you will dig into a gene feature to find its transcripts and coding sequences. For a more detailed description of how GFF can be mapped to complex transcripts, see the GFF3 documentation, which has diagrams and examples of different biological cases and how they are represented.


The test code features several other usage examples which should help provide familiarity with the interface. For real life testing, this was run against the latest C elegans WormBase release, WS199: GFF; sequences. On a standard single processor workstation, the code took about 2 and a half minutes to parse all PCR products and coding sequences from the 1.3G GFF file and 100M genome fasta file.


To go full circle back to my initial inspiration, the parsed GFF was pushed into a BioSQL database using this script. To test on your own machine, you will have to adjust the database details at the start of the script to match your local configuration instead of my test database.

Standard flattened features are well supported going into BioSQL. Nested features, like the coding sequence representation mentioned above, will need additional work. The current loader only utilizes sub_features to get location information and support the join(1..3,5..8) syntax of GenBank. The seqfeature_relationship table in BioSQL seems like the right place to start to support this.


This provides an initial implementation of GFF3 parsing support for Biopython. The interface is a proposal and I welcome suggestions on making it more intuitive. Code and test example contributions are also much appreciated. As we find an interface and implementation that works for the python community and the code stabilizes, we can work to integrate this into the Biopython project.

Written by Brad Chapman

March 8, 2009 at 11:01 am

Posted in OpenBio

Tagged with , , , ,

Organization of literature using PubMed related articles

leave a comment »

When dealing with a long list of journal articles, what is the best method to organize them? I was confronted with this problem in designing an interface where users would pick specific papers and retrieve results tied to them. Presenting them as the raw list was unsatisfying; it is fine for users who know exactly what articles they want, but naive users would have a lot of difficulty finding relevant articles. Even for power users, a better classification system could help reveal items they may not have known about.

The approach I took was to group together papers based on similarity. The NCBI PubMed literature database has links to related articles, which it exposes programmatically though EUtils. Using the Biopython Entrez interface, the first step is to retrieve a dictionary of related IDs for each starting article, ordered by relevance:

def _get_elink_related_ids(self, pmids):
    pmid_related = {}
    for pmid in pmids:
        handle = Entrez.elink(dbform='pubmed', db='pubmed', id=pmid)
        record =
        cur_ids = []
        for link_dict in record[0]['LinkSetDb'][0]['Link']:
            cur_ids.append((int(link_dict.get('Score', 0)),
        local_ids = [x[1] for x in cur_ids if x[1] in pmids]
        if pmid in local_ids:
        pmid_related[pmid] = local_ids
    return pmid_related

Trying to group directly based on this dictionary will often result in one large group, since many of the articles may be linked together through a few common articles. For instance, a review may be related to several other papers in non-overlapping areas. To make the results as useful as possible we define a maximum and minimum group size, and a two parameters to filter the related lists:

  • overrep_thresh: The percentage of papers an item is related to out of all papers being grouped; the threshold sets a maximum number of papers that can be related. For instance, a value of .25 means that a journal will be related to 25% or less of the total papers.
  • related_max: The number of related papers to use. The best related articles go into the grouping.

These parameters define a filter for our dictionary of related articles:

def _filter_related(self, inital_dict, overrep_thresh, related_max):
    final_dict = {}
    all_vals = reduce(operator.add, inital_dict.values())
    for item_id, item_vals in inital_dict.items():
        final_vals = [val for val in item_vals if 
            float(all_vals.count(val)) / len(inital_dict) <= overrep_thresh&#93;
        final_dict&#91;item_id&#93; = final_vals&#91;:related_max&#93;
    return final_dict

The filtered list is grouped using a generalized version of the <code>examine_paralogs</code> function used in an earlier post to group together <a href="">location and duplication information</a>. Sets combine any groups with overlapping articles:

def _groups_from_related_dict(self, related_dict):
    cur_groups = []
    all_base = related_dict.keys()
    for base_id, cur_ids in related_dict.items():
        overlap = set(cur_ids) & set(all_base)
        if len(overlap) > 0:
            new_group = set(overlap | set([base_id]))
            is_unique = True
            for exist_i, exist_group in enumerate(cur_groups):
                if len(new_group & exist_group) > 0:
                    update_group = new_group | exist_group
                    cur_groups[exist_i] = update_group
                    is_unique = False
            if is_unique:
    return [list(g) for g in cur_groups]

With this list, we want to extract the groups and their articles that fit in our grouping criteria, the minimum and maximum size:

def _collect_new_groups(self, pmid_related, groups):
final_groups = []
for group_items in groups:
final_items = [i for i in group_items if pmid_related.has_key(i)]
if (len(final_items) >= self._min_group and
len(final_items) <= self._max_group): final_groups.append(final_items) for item in final_items: del pmid_related[item] final_related_dict = {} for pmid, related in pmid_related.items(): final_related = [r for r in related if pmid_related.has_key(r)] final_related_dict[pmid] = final_related return final_groups, final_related_dict [/sourcecode]

Utilizing these functions, the main algorithm steps through a series of increasingly less stringent parameters picking out groups which fall into our thresholds. Closely related journal articles are grouped first; more general papers with less association will be placed in groups in later rounds:

def get_pmid_groups(self, pmids):
pmid_related = self._get_elink_related_ids(pmids)
filter_params = self._filter_params[:]
final_groups = []
while len(pmid_related) > 0:
if len(filter_params) == 0:
raise ValueError(“Ran out of parameters before finding groups”)
cur_thresh, cur_related = filter_params.pop(0)
while 1:
filt_related = self._filter_related(pmid_related, cur_thresh,
groups = self._groups_from_related_dict(filt_related)
new_groups, pmid_related = self._collect_new_groups(
pmid_related, groups)
if len(new_groups) == 0:
if len(pmid_related) < self._max_group: final_groups.append(pmid_related.keys()) pmid_related = {} return final_groups [/sourcecode]

The full code wrapped up into a class is available from the GitHub repository.

This is one approach to automatically grouping a large list of literature to make interesting items more discoverable. With the work being done on full text indexing, the data underlying resources such as iHOP might be used to do these groupings even more effectively using similar algorithms.

Written by Brad Chapman

February 13, 2009 at 6:19 am

Automated protein conservation display from BLAST alignments

with 16 comments

Pawel at Freelancing Science had an excellent post about making structural predictions from visual analysis of BLAST alignments. This concept inspired me to put together an automated solution to visualize protein conservation. Starting with a protein of interest it will retrieve the relevant identifier from NCBI, do a remote BLAST, examine the alignments to related divergent proteins, calculate conservation and display a plot of the results.

With a protein identifier like a GenBank accession number or UniProt ID, we first need to find the standard NCBI GI number. Using Biopython’s Entrez module:

def search_for_gi(self, uniprot_id, db_name):
    handle = Entrez.esearch(db=db_name, term=uniprot_id)
    record =
    ids = record["IdList"]
    if len(ids) == 0:
        raise ValueError("Not found in NCBI: %s" % ids)
    return ids[0]

Given the GI, a remote BLAST is performed and the XML result is parsed into a record object. This is again done using Biopython libraries, with the BLAST result cached in case of re-runs. If you were using this code for many queries or on a web page presented to scientists, it would make sense to use a local BLAST for speed purposes. This could easily be plugged in, again using Biopython. Here is the remote version:

def remote_blast(self, search_gi, blast_method):
    out_file = os.path.join(self._cache_dir, "%s_%s_blo.xml" % (blast_method,
    if not os.path.exists(out_file):
        blast_handle = NCBIWWW.qblast(blast_method, "nr", search_gi)
        with open(out_file, 'w') as out_handle:
            for line in blast_handle:
    with open(out_file) as in_handle:
        rec_it = NCBIXML.parse(in_handle)

With the parsed record, the next step is to loop over the alignments to calculate conservation across the protein. To provide quantitative results, a protein substitution matrix provides a score for each BLAST alignment character pair. Higher scores indicate a more conserved alignment, with exact matches being the highest scores. We use the BLOSUM62 matrix here, although a wide variety are supported by Biopython. The class below loops through all of the alignments and high scoring pairs (HSP, in BLAST nomenclature), notes the position, and uses the alignment pairs and matrix to assign conservation scores at each position:

class BlastConservationCalculator:
    def __init__(self, matrix_name="blosum62"):
        self._subs_mat = getattr(MatrixInfo, matrix_name)
        self._no_use_thresh = 0.95

    def conservation_dict(self, blast_rec):
        cons_dict = {}
        rec_size = int(blast_rec.query_letters)
        for base_index in range(rec_size):
            cons_dict[base_index] = []
        for align in blast_rec.alignments:
            for hsp in align.hsps:
                if (float(hsp.identities) / float(rec_size) <=
                    cons_dict = self._add_hsp_conservation(hsp, cons_dict)
        return cons_dict

    def _add_hsp_conservation(self, hsp, cons_dict):
        start_index = int(hsp.query_start) - 1
        hsp_index = 0
        for q_index in range(len(hsp.query)):
            if (hsp.query&#91;q_index&#93; != '-'):
                if (hsp.sbjct&#91;q_index&#93; != '-'):
                        sub_val = self._subs_mat&#91;(hsp.query&#91;q_index&#93;,
                    except KeyError:
                        sub_val = self._subs_mat&#91;(hsp.sbjct&#91;q_index&#93;,
                    cons_dict&#91;start_index + hsp_index&#93;.append(sub_val)
                hsp_index += 1
        return cons_dict

The result of this work is a dictionary of score conservation at each position. If you plot the average of these scores directly, it results in a very choppy graph which is hard to interpret. Luckily, Andrew Dalke has tackled this problem and presented a detailed writeup of <a href="">smoothing scores for plotting</a>. Using the Savitzky-Golay technique described there, the smoothed average of the results are plotted using <a href="">matplotlib</a>:

window_size = 29
data_smoother = SavitzkyGolayDataSmoother(window_size)
pos_data = []
cons_data = []
for pos in indexes:
    pos_data.append(pos + 1)
    if len(cons_dict[pos]) > 0:
smooth_data = data_smoother.smooth_values(cons_data)
smooth_pos_data = pos_data[data_smoother.half_window():
        len(pos_data) - data_smoother.half_window()]
pylab.plot(smooth_pos_data, smooth_data)
pylab.axis(xmin=min(pos_data), xmax=max(pos_data))
pylab.xlabel("Amino acid position")
pylab.savefig('%s_conservation.png' % accession.replace(".", "_"))

The resulting plot was prepared for the example from Pawel’s post that inspired all this and is shown below. We can see the 4 regions of less conservation noted by Pawel by inspection of the alignment, along with the 3 intervening peaks of conservation:

Example conservation plot

The full script puts all of these parts together into a working version that could be used for your own proteins of interest. These plots are useful for getting a quick overview of protein conservation when working with a new protein. They could be used to compare with known regions like domains, to identify likely regions of importance for protein deletion studies, or to make structure evaluations. The intriguing aspect of the plots is the opportunity to quickly evaluate and make predictions for experimental study.

Written by Brad Chapman

February 7, 2009 at 5:18 pm