Blue Collar Bioinformatics

Note: new posts have moved to http://bcb.io/ Please look there for the latest updates and comments

Posts Tagged ‘semanticweb

Extracting protein characteristics for an InterPro domain

with 4 comments

The InterPro database provides a common language and organization for characterized protein domains. Here, the goal is to extract all proteins containing a domain of interest, limit these proteins to those in the animal (Metazoa) kingdom, and extract information about the resulting proteins. Protein information will be retrieved from the UniProt database.

The first step is identifying the proteins of interest with a given domain. An example is the chromo domain motif with InterPro identifier IPR000953. InterPro provides a useful REST interface, allowing us to download the full list of related proteins in FASTA format and parse them using the Biopython SeqIO interface. For our example, this provides over 2500 records; here is a snippet of the class function that does this work.

def get_interpro_records(self, ipr_number):
    url_base = "%s/interpro/ISpy?ipr=%s&mode=fasta"
    full_url = url_base % (self._server, ipr_number)
    recs = []
    with self._get_open_handle(full_url) as in_handle:
        for rec in SeqIO.parse(in_handle, "fasta"):
            recs.append(rec)
    return recs

UniProt provides an excellent REST interface and XML format which help simplify the retrieval process. Parsing the XML records with the python ElementTree parser allows us to quickly pull out the organism name and evolutionary lineage.

import xml.etree.ElementTree as ET

def get_xml_metadata(self, uniprot_id):
    url_base = "%s/uniprot/%s.xml"
    full_url = url_base % (self._server, uniprot_id)
    metadata = {}
    with self._get_open_handle(full_url) as in_handle:
        root = ET.parse(in_handle).getroot()
        org = root.find("%sentry/%sorganism" % (self._xml_ns, self._xml_ns))
        for org_node in org:
            if org_node.tag == "%sname" % self._xml_ns:
                if org_node.attrib["type"] == "scientific":
                    metadata["org_scientific_name"] = org_node.text
                elif org_node.attrib["type"] == "common":
                    metadata["org_common_name"] = org_node.text
            elif org_node.tag == "%slineage" % self._xml_ns:
                metadata["org_lineage"] = [n.text for n in org_node]
    return metadata

Putting all the parts together, we loop over the list of Biopython sequence record objects, extract the UniProt ID, retrieve the metadata from UniProt, and store this in a local python shelve database:

cache_dir = os.path.join(os.getcwd(), "cache")
db_dir = os.path.join(os.getcwd(), "db")
interpro_retriever = InterproRestRetrieval(cache_dir)
uniprot_retriever = UniprotRestRetrieval(cache_dir)
cur_db = shelve.open(os.path.join(db_dir, ipr_number))
seq_recs = interpro_retriever.get_interpro_records(ipr_number)
for seq_rec in seq_recs:
    uniprot_id = seq_rec.id.split("|")[0]
    metadata = uniprot_retriever.get_xml_metadata(uniprot_id)
    if (metadata.has_key("org_lineage") and 
            "Metazoa" in metadata["org_lineage"]):
        metadata["seq"] = seq_rec.seq.data
        cur_db[uniprot_id] = metadata
cur_db.close()

The data is stored as a dictionary attached to the individual UniProt identifiers. This is modeled after the boto SimpleDB library, which provides a python interface to storage in Amazon’s SimpleDB.

All of these code snippets in action can be found in this example script, which helps place the code sections above in context. In future weeks, we will try and pull some interesting information from the protein domain families.

Written by Brad Chapman

January 10, 2009 at 4:22 pm

Extracting keywords from biological text using Zemanta

with 7 comments

Increasingly, my daily work is shifting from a model of “let me do this analysis for you and give you back some data” to “let me provide an interface that allows you to do the analyses yourself.” This is great as it allows more close collaboration with lab scientists, and also helps split up work so I can be involved in more projects. One common interface suggestion is a keyword google-like search box; enter some text and find anything related to this term. In implementing this, I wanted to provide reasonable search suggestions by identifying keywords from gene descriptions. These can help frame researchers questions, and prove clues about useful search terms for new users.

Here is an implementation of keyword extraction from biological text using the Zemanta semantic API. The function uses the Zemanta REST interface and parses JSON output with simplejson. The parsed JSON is available as a python dictionary.

def query_zemanta(search_text):
    gateway = 'http://api.zemanta.com/services/rest/0.0/'
    args = {'method': 'zemanta.suggest',
            'api_key': 'YOUR_API_KEY',
            'text': search_text,
            'return_categories': 'dmoz',
            'return_images': 0,
            'return_rdf_links' : 1,
            'format': 'json'}
    args_enc = urllib.urlencode(args)

    raw_output = urllib2.urlopen(gateway, args_enc).read()
    output = simplejson.loads(raw_output)

    print 'First article:', output['articles'][0]
    print 'Keywords:', [k['name'] for k in output['keywords']]
    for link in output['markup']['links']:
        print link['anchor']
        for target in link['target']:
            if target['type'] in ['wikipedia', 'rdf']:
                print '\t', target['title'], target['url']
    #print output
    print 'Marked up text', output['markup']['text']

In addition to extracting keywords, Zemanta also provides links to online resources. Here is the keyword list:

Keywords: [u'Drosophila', u'Biology', u'Messenger RNA', u'Germ cell',
           u'Mitogen-activated protein kinase', u'Oocyte', 
           u'C-Jun N-terminal kinases', u'RNA']

And here is the original testing text marked up with the automated links:

glh-2 encodes a putative DEAD-box RNA helicase that contains six CCHC zinc fingers and is homologous to Drosophila VASA, a germ-line-specific, ATP-dependent, RNA helicase; GLH-2 activity may also be required for the wild-type morphology of P granules and for localization of several protein components, but not accumulation of P granule mRNA components; GLH-2 interacts in vitro with itself and with KGB-1, a JNK-like MAP kinase; GLH-2 is a constitutive P granule component and thus, with the exception of mature sperm, is expressed in germ cells at all stages of development; GLH-2 is cytoplasmic in oocytes and the early embryo, while perinuclear in all later developmental stages as well as in the distal and medial regions of the hermaphrodite gonad; GLH-2 is expressed at barely detectable levels in males.

In addition to the keywords, Zemanta does an excellent job of automatically annotating the text with links to relevant resources:

  • Most impressively, the JNK acronym is determined to reference C-Jun N-terminal kinases, providing a link to the Wikipedia reference.
  • Zemanta also provides links to Freebase, an open database in RDF-ready format. One example is this useful link to Drosophila from which you could automatically extract NCBI Taxon IDs.
  • The one automated semantic mistake is the link to KGB; it provides a link to the KGB headquarters in Russia.

Zemanta also supports links to NCBI in their latest release (as a funny semantic miscue, the blog link to NCBI is to the wrong NCBI). I did not get any NCBI links with this few example, but it is exciting to see they are thinking about scientific applications.

In addition to Zemanta, I also tested OpenCalais with the python interface, which was not as successful. For the same text above, it returned only a single keyword: the incorrect KGB one. It appears as if their current focus is in finance, but it is worth watching for future developments.

Written by Brad Chapman

January 4, 2009 at 1:15 am

Posted in semanticweb

Tagged with , ,

Comparative genomics information retrieval from Ensembl

leave a comment »

The Ensembl website provides a powerful front end to genomic data from a wide variety of eukaryotic species. Additionally, the Comparative Genomics initiative provides automated phylogenetic analyses for comprehensive examination of gene families. This post describes retrieving comparative protein details for a human gene of interest using the tried and true method of screen scraping, presenting the data in a method ready to be presented via a REST interface.

Several other people have had similar notions for retrieving Ensembl data. Pedro describes an example using openKapow, and Andrew uses Dapper.

Here, we deal with the Ensembl web pages using Beautiful Soup, a Python web parser that simplifies the work of web page retrieval. The idea is to generate an interface that could be readily abstracted to a set of REST web pages. This would greatly simplify information retrieval from Ensembl; retrieving a set of orthologs for a gene ID would involve a workflow like:

  • Prepare a URL like:
    http://example.ensembl.org/REST/Compara_Ortholog/Homo_sapiens/ENSG00000173894
  • Parse out a simple text result in CSV:
    Cavia_porcellus,ENSCPOG00000005599
    Danio_rerio,ENSDARG00000044938
    Dasypus_novemcinctus,ENSDNOG00000002668
    

For queries that can be expressed with a few inputs and readily understandable outputs, this would provide programmatic access to Ensembl data without the overhead of installing the Perl API. Below is a function which retrieves the pages, parses them with Beautiful Soup, and returns the simplified information. To wrap this into a REST interface described above, would require adding a layer on top using a Python web framework like Pylons.

def orthologs(self, organism, gene_id):
    """Retrieve a list of orthologs for the given gene ID.
    """
    orthologs = []
    with self._get_open_handle("Gene", "Compara_Ortholog",
            organism, gene_id) as in_handle:
        soup = BeautifulSoup(in_handle)
        orth_table = soup.find("table", "orthologues")
        orth_links = orth_table.findAll("a", 
                href = re.compile("Gene/Summary"))
        for orth_link in orth_links:
            href_parts = [x for x in orth_link['href'].split('/') if x]
            orthologs.append((href_parts[0], orth_link.string))
    return orthologs

The full example script takes an organism and gene ID as input and displays homologs in Ensembl along with distance from the initial organism, protein domains, and features of the protein. The script uses Biopython, the Newick Tree Parser to parse phylogenetic trees, and NetworkX to calculate phylogenetic distances from the tree.

> python2.5 ensembl_remote_rest.py Homo_sapiens ENSG00000173894
Homo_sapiens 0 [u'IPR016197', u'IPR000953'] 38.0
Pan_troglodytes 0.009 [u'IPR016197', u'IPR000637', u'IPR000953'] 38.0
Gorilla_gorilla 0.0169 [u'IPR016197', u'IPR000637', u'IPR000953'] 
Macaca_mulatta 0.0538 [u'IPR000637', u'IPR000953'] 36.0
Tarsius_syrichta 0.1622 [u'IPR000637'] 
Microcebus_murinus 0.1848 [u'IPR000637', u'IPR000953'] 33.5

This demonstrates using the framework to look at the change in domains and protein charge across a gene family. The general idea of the EnsemblComparaRest< class could be applied to other information of interest available from Ensembl web pages.

Written by Brad Chapman

December 28, 2008 at 10:14 pm

Parsing GoPubMed RDF with Sparta

leave a comment »

GoPubMed provides a semantic layer on top of PubMed articles, linking them to GO and other ontologies and providing this metadata as RDF.

Starting out with RDF can be a bit hairy when dealing with the namespaces that prefix many of the URIs. Sparta is a python library that sits on top of rdflib providing nice shortcuts for accessing terms. For instance, if you have the following RDF namespace and term:

<rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/">
</rdf:RDF>
<dc:title>Example</dc:title>

Instead of having to access this attribute with http://purl.org/dc/elements/1.1/title in rdflib, sparta allows you to do so with the shortcut dc_title. Here is a short example with GoPubMed which prints out a PubMed record and its title:

import urllib
from rdflib import ConjunctiveGraph as Graph
import sparta

url = 'http://www.gopubmed.org/GoMeshPubMed/gomeshpubmed/Search/' + \
      'RDF?q=18463287&amp;type=RdfExportAll'
gopubmed_handle = urllib.urlopen(url)
graph = Graph()
graph.parse(gopubmed_handle)
gopubmed_handle.close()

graph_subjects = list(set(graph.subjects()))
sparta_factory = sparta.ThingFactory(graph)
for subject in graph_subjects:
    sparta_graph = sparta_factory(subject)
    print subject, [unicode(t) for t in sparta_graph.dc_title][0]

The Sparta library had not yet been updated to work with rdflib version 2.4, and also needed a few smaller fixes in matching URIs to attribute names. These are fixed in an updated version available here.

Written by Brad Chapman

December 21, 2008 at 1:38 am

Posted in semanticweb

Tagged with , ,

Standard Ontologies in BioSQL

leave a comment »

A recent thread by Peter on the BioSQL mailing list initiated some thinking about formalizing ontologies and terms in BioSQL. The current ad-hoc solution is that BioPerl, Biopython and BioJava attempt to use the same naming schemes. The worry is that this is not documented, no one is likely in a big hurry to document it, and we are essentially inventing another ontology.

The BioSQL methodology of storing key/value pair information on items can be mapped to RDF triples as:

BioSQL RDF
Bioentry or Feature Subject
Ontology Namespace of predicate
Term Predicate term, relative to namespace
Value Object

Thus, a nice place to look for ontologies is in standards intended for RDF. Greg Tyrelle thought this same way a while ago and came up with a XSLT to transform GenBank XML to RDF, using primarily the Dublin Core vocabulary. On the biology side, the Sequence Ontology project provides an ontology meant for describing biological sequences. This includes a mapping to GenBank feature table names.

Using these as a starting point, I generated a mapping of GenBank names to names in the Dublin Core and SO ontologies. This is meant as a basis for standardizing and documenting naming in BioSQL. The mapping file thus far covers almost all of the header and feature keys, and more than half of the qualifier keys:

I would welcome suggestions for missing GenBank terms, as well as corrections on the terms mapped by hand.

Some notes on the mapping:

  • Cross references to other identifiers are mapped with the Dublin Core term ‘relation’. These can occur in many places in the GenBank format. Using a single term allows them to be flattened, with mapping values in form of ‘database:identifier.’ This is consistent with the GenBank /db_xref qualifier.
  • Multiple names or descriptions of an item, also stored in multiple places in GenBank files, receive the Dublin Core term ‘alternative.’
  • Organism and taxonomy ontologies are a whole project onto themselves, so I didn’t try to tackle them here.

Some other useful links for biological ontology mapping:

Written by Brad Chapman

December 14, 2008 at 9:40 pm