Blue Collar Bioinformatics

Note: new posts have moved to http://bcb.io/ Please look there for the latest updates and comments

Graphing of variables before classification

leave a comment »

It is important to be able to graphically evaluate variables that feed into classification algorithms. In addition to assessing the distribution of the data, visual inspection also reveals patterns that can be later tested statistically. This post describes the preparation of a quick histogram for data from positive and negative examples feeding into a classifier. The excellent matplotlib library is used for visualization.

The example involves classifying two sets of proteins based on regional sequence charge. Two FASTA files were prepared, containing positive (active) and negative (non-active) examples. The goal is to look for a difference in charge between the two groups. Given a window size of amino acids to look at, we loop over the records in the file using the Biopython SeqIO interface:

def file_charges(in_file, cur_window):
    all_charges = []
    with open(in_file) as in_handle:
        for rec in SeqIO.parse(in_handle, "fasta"):
            cur_charges = calc_region_charges(rec.seq, cur_window)
            all_charges.extend(cur_charges)
    return all_charges

We also use Biopython to calculate the Isoelectric point of each protein window. This is used as a proxy for charge; higher isoelectric points correspond to higher charges over the region.

def calc_region_charges(seq, cur_window):
    # internal small regions, so do not deal with C and N terminal charges
    IsoelectricPoint.pKcterminal = {}
    IsoelectricPoint.pKnterminal = {}
    cur_pos = 0
    region_charges = []
    while cur_pos < len(seq) - cur_window:
        cur_seq = seq&#91;cur_pos:cur_pos + cur_window&#93;
        prot_analysis = ProtParam.ProteinAnalysis(str(cur_seq))
        ie_calc = IsoelectricPoint.IsoelectricPoint(cur_seq,
                prot_analysis.count_amino_acids())
        region_charges.append(ie_calc.pi())
        cur_pos += 1
    return region_charges
&#91;/sourcecode&#93;

<p>
With this in place, we get the charges for the example records and plot them side by side as a histogram using matplotlib:
</p>


cur_window = 75
pos_charges = file_charges(pos_file, cur_window)
neg_charges = file_charges(neg_file, cur_window)
n, bins, patches = pylab.hist([pos_charges, neg_charges], 30,
        normed=True, histtype='bar')
pylab.xlabel('Isoelectric point')
pylab.ylabel('Normalized percent of regions')
pylab.title('Protein charge of %s amino acid windows' % cur_window)
pylab.legend([patches[0][0], patches[1][0]], ['positives', 'negatives'])
pylab.savefig('pos_neg_hist.png')
pylab.show()

The resulting graph shows the different distribution of charge between the positive and negative records. At an isoelectric point just above 10, only sequence windows from the positive examples are found. Looking at the percentage of highly charged sequence regions above this threshold in non-classified sequences could serve as the basis for a classifier.

Example histogram

Example histogram

The example script puts all of this together, and could be used as a template for classification problems in your own research.

Written by Brad Chapman

January 24, 2009 at 1:26 pm

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: