Extracting protein characteristics for an InterPro domain

The InterPro database provides a common language and organization for characterized protein domains. Here, the goal is to extract all proteins containing a domain of interest, limit these proteins to those in the animal (Metazoa) kingdom, and extract information about the resulting proteins. Protein information will be retrieved from the UniProt database.

The first step is identifying the proteins of interest with a given domain. An example is the chromo domain motif with InterPro identifier IPR000953. InterPro provides a useful REST interface, allowing us to download the full list of related proteins in FASTA format and parse them using the Biopython SeqIO interface. For our example, this provides over 2500 records; here is a snippet of the class function that does this work.

def get_interpro_records(self, ipr_number):
    url_base = "%s/interpro/ISpy?ipr=%s&mode=fasta"
    full_url = url_base % (self._server, ipr_number)
    recs = []
    with self._get_open_handle(full_url) as in_handle:
        for rec in SeqIO.parse(in_handle, "fasta"):
            recs.append(rec)
    return recs

UniProt provides an excellent REST interface and XML format which help simplify the retrieval process. Parsing the XML records with the python ElementTree parser allows us to quickly pull out the organism name and evolutionary lineage.

import xml.etree.ElementTree as ET

def get_xml_metadata(self, uniprot_id):
    url_base = "%s/uniprot/%s.xml"
    full_url = url_base % (self._server, uniprot_id)
    metadata = {}
    with self._get_open_handle(full_url) as in_handle:
        root = ET.parse(in_handle).getroot()
        org = root.find("%sentry/%sorganism" % (self._xml_ns, self._xml_ns))
        for org_node in org:
            if org_node.tag == "%sname" % self._xml_ns:
                if org_node.attrib["type"] == "scientific":
                    metadata["org_scientific_name"] = org_node.text
                elif org_node.attrib["type"] == "common":
                    metadata["org_common_name"] = org_node.text
            elif org_node.tag == "%slineage" % self._xml_ns:
                metadata["org_lineage"] = [n.text for n in org_node]
    return metadata

Putting all the parts together, we loop over the list of Biopython sequence record objects, extract the UniProt ID, retrieve the metadata from UniProt, and store this in a local python shelve database:

cache_dir = os.path.join(os.getcwd(), "cache")
db_dir = os.path.join(os.getcwd(), "db")
interpro_retriever = InterproRestRetrieval(cache_dir)
uniprot_retriever = UniprotRestRetrieval(cache_dir)
cur_db = shelve.open(os.path.join(db_dir, ipr_number))
seq_recs = interpro_retriever.get_interpro_records(ipr_number)
for seq_rec in seq_recs:
    uniprot_id = seq_rec.id.split("|")[0]
    metadata = uniprot_retriever.get_xml_metadata(uniprot_id)
    if (metadata.has_key("org_lineage") and 
            "Metazoa" in metadata["org_lineage"]):
        metadata["seq"] = seq_rec.seq.data
        cur_db[uniprot_id] = metadata
cur_db.close()

The data is stored as a dictionary attached to the individual UniProt identifiers. This is modeled after the boto SimpleDB library, which provides a python interface to storage in Amazon’s SimpleDB.

All of these code snippets in action can be found in this example script, which helps place the code sections above in context. In future weeks, we will try and pull some interesting information from the protein domain families.

Written by Brad Chapman

January 10, 2009 at 4:22 pm

Posted in information retrieval

Tagged with bioinformatics, python, semanticweb

4 Responses

Subscribe to comments with RSS.

Wouldn’t it be simpler and faster to obtain the data directly from UniProt?

http://www.uniprot.org/uniprot/?query=IPR000953+taxonomy:33208

(Follow the download link to get a link for retrieving the matching entries in whatever format directly.)

Eric Jain

March 6, 2009 at 6:38 pm

Reply
- Eric;
  Agreed completely; the UniProt download pages are very useful. Here I was going for more of a Web Services style approach as opposed to the download/off-line process way. To give a bit of history, I was originally intending to parse the RDF when writing this, but my RDF skills weren’t quite up to par so I fell back on the XML.
  
  Thanks again for pointing out the download utilities,
  Brad
  
  Brad Chapman
  
  March 7, 2009 at 12:30 pm
  
  Reply
  - The download function is not just for offline processing, you can use it to stream a set of entries and process them on the fly as well.
    
    If for some reason (e.g. ease or processing) you do want to retrieve entries one by one rather than in one large stream, the fastest and simplest way to get a list of accession numbers for Metazoa entries associated with a given InterPro identifier is like so:
    
    http://www.uniprot.org/uniprot/?query=IPR000953+taxonomy:33208&format=list
    
    Eric Jain
    
    March 7, 2009 at 1:19 pm
    
    Reply
    - Eric;
      Awesome — thanks for the UniProt tip. I hadn’t realized that and it would have come in very handy. It should be quite useful to anyone who stumbles across this post.
      
      Brad Chapman
      
      March 8, 2009 at 11:08 am
      
      Reply

Blue Collar Bioinformatics