Blue Collar Bioinformatics

Note: new posts have moved to http://bcb.io/ Please look there for the latest updates and comments

Automated protein conservation display from BLAST alignments

with 16 comments

Pawel at Freelancing Science had an excellent post about making structural predictions from visual analysis of BLAST alignments. This concept inspired me to put together an automated solution to visualize protein conservation. Starting with a protein of interest it will retrieve the relevant identifier from NCBI, do a remote BLAST, examine the alignments to related divergent proteins, calculate conservation and display a plot of the results.

With a protein identifier like a GenBank accession number or UniProt ID, we first need to find the standard NCBI GI number. Using Biopython’s Entrez module:

def search_for_gi(self, uniprot_id, db_name):
    handle = Entrez.esearch(db=db_name, term=uniprot_id)
    record = Entrez.read(handle)
    ids = record["IdList"]
    if len(ids) == 0:
        raise ValueError("Not found in NCBI: %s" % ids)
    return ids[0]

Given the GI, a remote BLAST is performed and the XML result is parsed into a record object. This is again done using Biopython libraries, with the BLAST result cached in case of re-runs. If you were using this code for many queries or on a web page presented to scientists, it would make sense to use a local BLAST for speed purposes. This could easily be plugged in, again using Biopython. Here is the remote version:

def remote_blast(self, search_gi, blast_method):
    out_file = os.path.join(self._cache_dir, "%s_%s_blo.xml" % (blast_method,
        search_gi))
    if not os.path.exists(out_file):
        blast_handle = NCBIWWW.qblast(blast_method, "nr", search_gi)
        with open(out_file, 'w') as out_handle:
            for line in blast_handle:
                out_handle.write(line)
        blast_handle.close()
    with open(out_file) as in_handle:
        rec_it = NCBIXML.parse(in_handle)
        return rec_it.next()

With the parsed record, the next step is to loop over the alignments to calculate conservation across the protein. To provide quantitative results, a protein substitution matrix provides a score for each BLAST alignment character pair. Higher scores indicate a more conserved alignment, with exact matches being the highest scores. We use the BLOSUM62 matrix here, although a wide variety are supported by Biopython. The class below loops through all of the alignments and high scoring pairs (HSP, in BLAST nomenclature), notes the position, and uses the alignment pairs and matrix to assign conservation scores at each position:

class BlastConservationCalculator:
    def __init__(self, matrix_name="blosum62"):
        self._subs_mat = getattr(MatrixInfo, matrix_name)
        self._no_use_thresh = 0.95

    def conservation_dict(self, blast_rec):
        cons_dict = {}
        rec_size = int(blast_rec.query_letters)
        for base_index in range(rec_size):
            cons_dict[base_index] = []
        for align in blast_rec.alignments:
            for hsp in align.hsps:
                if (float(hsp.identities) / float(rec_size) <=
                        self._no_use_thresh):
                    cons_dict = self._add_hsp_conservation(hsp, cons_dict)
        return cons_dict

    def _add_hsp_conservation(self, hsp, cons_dict):
        start_index = int(hsp.query_start) - 1
        hsp_index = 0
        for q_index in range(len(hsp.query)):
            if (hsp.query&#91;q_index&#93; != '-'):
                if (hsp.sbjct&#91;q_index&#93; != '-'):
                    try:
                        sub_val = self._subs_mat&#91;(hsp.query&#91;q_index&#93;,
                                                  hsp.sbjct&#91;q_index&#93;)&#93;
                    except KeyError:
                        sub_val = self._subs_mat&#91;(hsp.sbjct&#91;q_index&#93;,
                                                  hsp.query&#91;q_index&#93;)&#93;
                    cons_dict&#91;start_index + hsp_index&#93;.append(sub_val)
                hsp_index += 1
        return cons_dict
&#91;/sourcecode&#93;

<p>
The result of this work is a dictionary of score conservation at each position. If you plot the average of these scores directly, it results in a very choppy graph which is hard to interpret. Luckily, Andrew Dalke has tackled this problem and presented a detailed writeup of <a href="http://www.dalkescientific.com/writings/NBN/plotting.html">smoothing scores for plotting</a>. Using the Savitzky-Golay technique described there, the smoothed average of the results are plotted using <a href="http://matplotlib.sourceforge.net/">matplotlib</a>:
</p>


window_size = 29
data_smoother = SavitzkyGolayDataSmoother(window_size)
pos_data = []
cons_data = []
for pos in indexes:
    pos_data.append(pos + 1)
    if len(cons_dict[pos]) > 0:
        cons_data.append(numpy.median(cons_dict[pos]))
    else:
        cons_dict.append(0)
smooth_data = data_smoother.smooth_values(cons_data)
smooth_pos_data = pos_data[data_smoother.half_window():
        len(pos_data) - data_smoother.half_window()]
pylab.plot(smooth_pos_data, smooth_data)
pylab.axis(xmin=min(pos_data), xmax=max(pos_data))
pylab.xlabel("Amino acid position")
pylab.ylabel("Conservation")
pylab.savefig('%s_conservation.png' % accession.replace(".", "_"))

The resulting plot was prepared for the example from Pawel’s post that inspired all this and is shown below. We can see the 4 regions of less conservation noted by Pawel by inspection of the alignment, along with the 3 intervening peaks of conservation:

Example conservation plot

The full script puts all of these parts together into a working version that could be used for your own proteins of interest. These plots are useful for getting a quick overview of protein conservation when working with a new protein. They could be used to compare with known regions like domains, to identify likely regions of importance for protein deletion studies, or to make structure evaluations. The intriguing aspect of the plots is the opportunity to quickly evaluate and make predictions for experimental study.

Written by Brad Chapman

February 7, 2009 at 5:18 pm

16 Responses

Subscribe to comments with RSS.

  1. Hello,

    I got this version of your script from github:
    http://preview.tinyurl.com/cemlp4

    While executing I am getting following error:

    File “./chapman_01.py”, line 42, in main
    cons_dict.append(0)
    AttributeError: ‘dict’ object has no attribute ‘append’

    I am running Python 2.5.2 on Ubuntu 8.04, biopython 1.49.

    Hope it helps,

    Darek Kedra

    darked

    March 2, 2009 at 8:37 am

    • Darek;
      Thanks for the problem report. My mistake; that was a typo and the line should be replaced with:

      cons_data.append(0)

      I fixed this on github, so you can also download the updated version from there. Hope this works for you,
      Brad

      Brad Chapman

      March 2, 2009 at 8:47 am

      • Brad,

        wow, that was FAST! Works now as expected. Many thanks.

        It is a great quick-and-dirty way of looking at a protein. I am not sure if it makes sense to sacrifice speed and simplicity for more accuracy, i.e. by using only non-identical blastp hits or using more than 50 hit sequences.
        In the end this will likely come near to a conservation plot of multiple sequence alignment / programs determining functional AA residues.

        * for selecting divergent blast hits I started using: http://bmf2.colorado.edu/divergentset

        * for functional residues INTREPID
        http://phylogenomics.berkeley.edu/activesite/details.html

        but these can be very slow.

        Darek Kedra

        darked

        March 2, 2009 at 4:53 pm

  2. Your program is sure to be very convenient. But I’ve got a tiny query: is it possible to use this program to visualize sequence conservation when BLAST is performed using a query sequence against some proteins in a custom database of mine (where sequences are stored in FASTA format) instead of the non-redundant proteins of NCBI’s database?

    bestdecoy-108@yahoo.co.in

    March 17, 2012 at 7:39 am

  3. I am trying to run the code but I have tough time knowing how would I will be able to have the input data, should it be included in the command line or what exactly? Please explain that using a command line. Thanks a lot
    Nawaf.

    Steven

    April 27, 2012 at 2:55 pm

    • The script header describes the input. It’s an NCBI accession number or protein ID:

      https://github.com/chapmanb/bcbb/blob/master/visualize/blast_conservation_plot.py#L4

      As discussed in the comments above, this takes advantage of NCBI’s remote BLAST but could be modified to support local queries or your own FASTA inputs. Hope this helps.

      Brad Chapman

      April 27, 2012 at 3:33 pm

      • How can I modified that to support my own local queries, any idea?
        Thanks

        Steven

        April 28, 2012 at 2:12 pm

      • Also, if I want to just use the NCBI’s remote blast how can I get the output that should be for the program. Because when I run the program using linux, python program.py , it does give me only the message that is at the start of the program but doesn’t give me anything as an output? How that can be obtain?
        Thanks.

        Steven

        April 28, 2012 at 2:19 pm

        • Steven;
          To run it you need to pass an NCBI accession to the script. So to reproduce the example plot:

          $ python blast_conservation_plot.py BAA36600.1
          $ display BAA36600_1_conservation.png
          

          Supporting local queries would require writing custom code to replace the current ‘remote_blast’ function. If you look in the comments above I’ve provided a few links to the relevant parts of the script.

          Hope this helps.

          Brad Chapman

          April 30, 2012 at 6:51 am

  4. Hey, I am trying to run your module through python 3.3 on windows and I keep getting a syntax error near the end. It doesn’t like the second quotation mark around “Incorrect arguments”. I’m sure that isn’t the real problem, but I’m not sure what is. Could you help me?

    Thanks!

    Josh

    September 14, 2013 at 2:27 pm

    • Josh;
      This script isn’t Python 3 compatible, but should work okay on recent versions of Python 2.x like 2.7. There are some significant syntax changes in Python 3 that require porting scripts over. Hope this works cleanly on 2.7 for you.

      Brad Chapman

      September 15, 2013 at 8:33 pm

  5. Hi,
    I would like to make a conservation plot of my protein (using selected species). The graph shown here is actually what I want, but it is way beyond my understanding how it is done (=I do not know anything about programming). Does perhaps a user friendly program exist where I could put a ClustalW alignment of my protein sequences and get out a nice protein conservation plot as shown on this web page?
    Thanks!

    Tinka

    November 10, 2013 at 1:50 am

  6. Tinka;
    I’m glad to hear this approach is useful. Unfortunately I don’t have anything more automated or user-friendly than the script posted here. Just from searching around a bit, perhaps the Plot Protein tool would be helpful (base site: https://sites.google.com/site/plotprotein/ and web tool: http://glimmer.rstudio.com/tturner/plotProtein/)

    Brad Chapman

    November 10, 2013 at 1:41 pm

    • Dear Brad,
      thank you for you fast response and suggestion. I spend quite some time on the net search for a programs to make a protein conservation plot and along the search I also found the site you mentioned. But none of the programs (I found 3 or 4) made as simple/nice looking and straightforward plots as shown here.

      I guess not that nice looking graph is till better than no graph :)

      Tinka

      November 11, 2013 at 10:30 am

  7. Hi Brad,
    It looks like the utility of this script and graph are standing the test of time! After trying to run this script, I have a few questions for you.

    After running the script at the command line with the accession code used in the example, I got the following error code:

    Email address is not specified.

    To make use of NCBI’s E-utilities, NCBI requires you to specify your
    email address with each request. As an example, if your email address
    is A.N.Other@example.com, you can specify it as follows:
    from Bio import Entrez
    Entrez.email = ‘A.N.Other@example.com’
    In case of excessive usage of the E-utilities, NCBI will attempt to contact
    a user at the email address provided before blocking access to the
    E-utilities.
    E-utilities.”””, UserWarning)

    And I couldn’t find an area in the script to specify an email? Do you have an idea about how I might fix this?

    Additionally, my goal is to make a conservation plot with only bacteria and fungi in my blast, which will require additional specifications. Do you know if I could add these specifications into this script without writing custom code to do a local blast?

    Thanks for your help!
    Elliot

    Elliot Hershberg

    July 14, 2017 at 5:23 pm


Leave a reply to Brad Chapman Cancel reply