Blue Collar Bioinformatics

Note: new posts have moved to http://bcb.io/ Please look there for the latest updates and comments

Finding proteins with related function using semantic clustering

leave a comment »

An immediate goal when investigating a protein of interest is to identify related proteins. There are several ways you can go about this:

  • Search for proteins by sequence similarity, using BLAST or other tools.
  • Search for proteins with similar characteristics, using features like InterPro domains.
  • Search for proteins determined to have similar functionality, using the literature.

Here we will automate the literature-functionality approach using Zemanta. Previously, we talked about how well Zemanta extracted biological information from gene descriptions. This can be used to classify UniProt descriptions and find proteins with similar functionality to a target of interest.

In this example, we will look for proteins similar to a Polycomb chromatin remodeling protein in Mouse. As our base, we build a query of all mouse proteins in UniProt classified as repressors: organism:"Mus musculus [10090]" AND keyword:"Repressor [678]". This provides 350 records, which can be downloaded using the UniProt web interface as a tab delimited file.

Next, Zemanta is used to extract keywords linked to either Wikipedia or Freebase. Starting with our UniProt XML retriever, we get the functional descriptions:

def get_description_terms(retriever, cur_id, api_key):
    metadata = retriever.get_xml_metadata(cur_id)
    if metadata.has_key("function_descr"):
        keywords = zemanta_link_kws(metadata["function_descr"], api_key)
        if len(keywords) > 0:
            return keywords
    return []

These are fed into Zemanta, extracting the keywords:

def zemanta_link_kws(search_text, api_key):
    gateway = 'http://api.zemanta.com/services/rest/0.0/'
    args = {'method': 'zemanta.suggest',
            'api_key': api_key,
            'text': search_text,
            'return_categories': 'dmoz',
            'return_images': 0,
            'return_rdf_links' : 1,
            'format': 'json'}
    args_enc = urllib.urlencode(args)
    raw_output = urllib2.urlopen(gateway, args_enc).read()
    output = simplejson.loads(raw_output)
    link_kws = []
    for link in output['markup']['links']:
        for target in link['target']:
            if target['type'] in ['wikipedia', 'rdf']:
                link_kws.append(target['title'])
    return list(set(link_kws))

With a list of UniProt IDs and keywords organized into a dictionary we then build a binary matrix of terms for clustering. This approach is described in the excellent book Programming Collective Intelligence. The resulting matrix has UniProt IDs as rows and keyword terms as columns. Each value will be 1 if the term is contained in the Zemanta extracted keywords for that UniProt ID, or 0 otherwise:

def organize_term_array(cur_db):
    all_terms = reduce(operator.add, cur_db.values())
    term_counts = collections.defaultdict(lambda: 0)
    for term in all_terms:
        term_counts[term] += 1
    all_terms = list(set(all_terms))
    term_matrix = []
    all_ids = []
    for uniprot_id, cur_terms in cur_db.items():
        cur_row = [(1 if t in cur_terms else 0) for t in all_terms]
        term_matrix.append(cur_row)
        all_ids.append(uniprot_id)
    return numpy.array(term_matrix), all_ids

Finally, this matrix is fed into the PyCluster module in Biopython for k-means clustering. The clusters are examined and information on the other IDs clustered with our target organism are printed:

cluster_ids, error, nfound = Cluster.kcluster(term_matrix,
        nclusters=10, npass=20, method='a', dist='e')
cluster_dict = collections.defaultdict(lambda: [])
for i, cluster_id in enumerate(cluster_ids):
    cluster_dict[cluster_id].append(uniprot_ids[i])
for cluster_group in cluster_dict.values():
    if target_id in cluster_group:
        for item in cluster_group:
            print item, cur_db[item]

The full script shows all of these parts tied together. For our example, Zemanta pulls out the following keyword list: ['Polycomb-group proteins', 'Homeotic gene', 'Histone', 'Lys (department)', 'Chromatin', 'Histone H2A'] and clustering identifies 19 similar proteins. This could be presented through a web interface into which a scientist enters a protein of interest and gets back the resulting list for manual inspection.

Conceptually, this automated approach is similar to an expert searching through the literature. Here, we are virtually clicking through Wikipedia links and noting similarities. By being able to leverage general purpose tools like Zemanta, we avoid having to build a science specific tool for this purpose. This is an additional argument for adoption of general tools like Wikipedia for scientific annotation.

Written by Brad Chapman

January 19, 2009 at 12:34 pm

Posted in semanticweb

Tagged with , ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: