Blue Collar Bioinformatics

Note: new posts have moved to Please look there for the latest updates and comments

Graphing of variables before classification

with one comment

It is important to be able to graphically evaluate variables that feed into classification algorithms. In addition to assessing the distribution of the data, visual inspection also reveals patterns that can be later tested statistically. This post describes the preparation of a quick histogram for data from positive and negative examples feeding into a classifier. The excellent matplotlib library is used for visualization.

The example involves classifying two sets of proteins based on regional sequence charge. Two FASTA files were prepared, containing positive (active) and negative (non-active) examples. The goal is to look for a difference in charge between the two groups. Given a window size of amino acids to look at, we loop over the records in the file using the Biopython SeqIO interface:

def file_charges(in_file, cur_window):
    all_charges = []
    with open(in_file) as in_handle:
        for rec in SeqIO.parse(in_handle, "fasta"):
            cur_charges = calc_region_charges(rec.seq, cur_window)
    return all_charges

We also use Biopython to calculate the Isoelectric point of each protein window. This is used as a proxy for charge; higher isoelectric points correspond to higher charges over the region.

def calc_region_charges(seq, cur_window):
    # internal small regions, so do not deal with C and N terminal charges
    IsoelectricPoint.pKcterminal = {}
    IsoelectricPoint.pKnterminal = {}
    cur_pos = 0
    region_charges = []
    while cur_pos < len(seq) - cur_window:
        cur_seq = seq&#91;cur_pos:cur_pos + cur_window&#93;
        prot_analysis = ProtParam.ProteinAnalysis(str(cur_seq))
        ie_calc = IsoelectricPoint.IsoelectricPoint(cur_seq,
        cur_pos += 1
    return region_charges

With this in place, we get the charges for the example records and plot them side by side as a histogram using matplotlib:

cur_window = 75
pos_charges = file_charges(pos_file, cur_window)
neg_charges = file_charges(neg_file, cur_window)
n, bins, patches = pylab.hist([pos_charges, neg_charges], 30,
        normed=True, histtype='bar')
pylab.xlabel('Isoelectric point')
pylab.ylabel('Normalized percent of regions')
pylab.title('Protein charge of %s amino acid windows' % cur_window)
pylab.legend([patches[0][0], patches[1][0]], ['positives', 'negatives'])

The resulting graph shows the different distribution of charge between the positive and negative records. At an isoelectric point just above 10, only sequence windows from the positive examples are found. Looking at the percentage of highly charged sequence regions above this threshold in non-classified sequences could serve as the basis for a classifier.

Example histogram

Example histogram

The example script puts all of this together, and could be used as a template for classification problems in your own research.

Written by Brad Chapman

January 24, 2009 at 1:26 pm

One Response

Subscribe to comments with RSS.

  1. This site is just gret. I’ve search these informations a whole lot and I
    view it that is professional, fast to comprehend. I congratulate you for this research that I’ll recommend
    to prospects friends. I ask you to go to the site where each pupil or university student can find ratings gpa levels.

    Halley Gallishaw

    March 31, 2018 at 12:42 pm

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: