Blue Collar Bioinformatics

Note: new posts have moved to Please look there for the latest updates and comments

Archive for the ‘information retrieval’ Category

Python query interface to BioGateway SPARQL endpoint and InterMine

with one comment

I spent the last week in Tokyo, Japan at BioHackathon 2010, an extremely productive meet up of open source coders organized by Toshiaki Katayama and the great folks at the Database Center of Life Sciences (DBCLS). The focus of the week was improving biological toolkits for accessing the Semantic Web of linked data.

The technical focus was on RDF (Resource Description Format), a standard way to represent data as a triple; each triple is made up of a subject defining the item, a predicate specifying a relationship, and a object representing the linked data. By providing data in RDF along with common naming schemes for common objects we facilitate linking biological data in mashups, expanding the ability of researchers to discover relationships between disparate data sources.

RDF is managed in data stores like Virtuoso, which are equivalent to relational or document based databases. For programmers, the primary technology for querying these stores is SPARQL, a query language similar to SQL. The goal of the Biopython programming team at the Hackathon was to provide an easy to use Python library to query RDF stores through SPARQL.

Interface design

The python interface organizes common biological queries on a datastore without exposing the backend implementation details. The final interface simplifies the process of making a query to two steps:

  • Instantiate a query builder, providing it with two sets of data: attributes to retrieve and items to filter against. This is modeled after BioMart querying and the R biomaRt interface, providing a generic, well understood way to specify the query.
  • Pass the query builder to a retrieval server which executes the query and returns the results in a tabular format, as a numpy RecordArray.

The user is never exposed to the underlying implementation details, as the library performs the work of building the query, submitting it to the remote server and reformatting the results.

Two interfaces were implemented at BioHackathon 2010:

  • BioGateway — a SPARQL endpoint to a RDF data store containing SwissProt data semantically linked to Gene Ontology (GO) terms.
  • InterMine — a XML based interface to a traditional relational database backend, containing organized metadata for primary experimental data from many organisms.

By providing a common interface for both semantic data and more traditional data sources, we hope to facilitate the conversion by data providers to RDF where it simplifies their backend storage and queries. Users of this high level interface do not need to worry about the underlying implementation, instead focusing resources on developing their biological queries.


BioGateway organizes the SwissProt database of protein sequences along with Gene Ontology Annotations (GOA) into an integrated RDF database. Data access is provided through a SPARQL query endpoint, allowing searches for proteins based on a combination of GO and SwissProt data.

This query searches for proteins that are involved in insulin response and linked to diabetes. The protein name, other proteins associated via protein-protein interaction, and the gene name are retrieved:

from systemsbio import Biogateway, UniProtGOQueryBuilder
builder = UniProtGOQueryBuilder("Homo sapiens")
builder.add_filter("GO_term", "insulin")
builder.add_filter("disease_description", "diabetes")
builder.add_attributes(["protein_name", "interactor", "gene_name"])
server = Biogateway()
results =
print len(results), results.dtype.names
result = results[0]
print result['protein_name'], result['gene_name'], \
      result['interactor'], result['GO_term']

An orthogonal search approach is to start with a protein of interest and retrieve linked details. Here we identify primary journal papers about a protein:

from systemsbio import Biogateway, ReferenceBuilder
builder = ReferenceBuilder()
builder.add_filter("protein_id", "1433B_HUMAN")
server = Biogateway()
results =
print len(results), results.dtype.names
result = results[0]
print result['protein_id'], result['reference']

Using the associated PubMed IDs, we can retrieve the paper details using Biopython and NCBI Entrez, utilizing BioGateway as part of an automated analysis pipeline:

from Bio import Entrez = ""
pubmed_id = result['reference'].replace("PMID_", "")
handle = Entrez.esummary(db="pubmed", id=pubmed_id)
record =[0]
print record['Title']
print record['PubDate']
print ",".join(record['AuthorList'])
print record['FullJournalName'], record['Volume'], record['Pages']
# Novel raf kinase protein-protein interactions found by an exhaustive yeast two-hybrid analysis.
# 2003 Feb
# Yuryev A,Wennogle LP
# Genomics 81 112-25

The full source code is available from GitHub: The implementation builds a SPARQL query based on the provided attributes and filters, using SPARQLwrapper to interact with the remote server and parse the results.


InterMine is an open source data management system used to power databases of primary research results like FlyMine and modMine. It stores metadata associated with projects in a structured way, enabling searches to identify data submissions of interest to biologists. It contains two useful web based tools to facilitate these searches:

  • Templates — Pre-defined queries that capture common ways biologists search the database.

  • Query builder — A graphical interface to define custom queries, allowing manual discovery of attributes of interest.

We access InterMine programatically using the same builder and server paradigms used in our BioGateway interface. The query below searches modMine for C. elegans experiments characterizing the H3K4Me3 histone modification, which is associated with chromatin structure in active genes. The returned submission identifiers can be used to examine the primary data associated with the experiment:

from intermine import Intermine, SubmissionQueryBuilder
builder = SubmissionQueryBuilder()
    "submission_title", "developmental_stage"])
builder.add_filter("organism", "Caenorhabditis elegans")
builder.add_filter("antibody_name", "H3K4me3")
server = Intermine("")
table =
print table.dtype.names
print table
# ('submission_id', 'submission_title', 'developmental_stage')
# [('176', 'Histone_H3K4me3_from_N2_L3_larvae_(AR0169_H3K4me3_N2_L3)', '')
# ('2311', 'Histone_H3K4me3_N2_EEMB_(WA30534819_H3K4ME3_N2_EEMB)',
# 'Early Stage Embryos')
# ('2410', 'Histone_H3K79me1_N2_EEMB_(AB2886_H3K79ME1361912_N2_EEMB)',
# 'Early Stage Embryos')]

An advantage of defining query builders is that we can provide custom functionality to access more complex queries. The code below searches for C. elegans ChIP-seq experiments using a free text search. The implementation searches for the query term against several description fields in the database, hiding these details from the user:

from intermine import Intermine, ExperimentQueryBuilder
builder = ExperimentQueryBuilder()
builder.add_attributes(["submission_id", "experiment_name"])
builder.add_filter("organism", "Caenorhabditis elegans")
server = Intermine("")
table =
print table.dtype.names
print table
# ('submission_id', 'experiment_name')
# [('582', 'ChIP-Seq Identification of C. elegans TF Binding Sites')
# ('584', 'ChIP-Seq Identification of C. elegans TF Binding Sites')
# ...

The full source code is available from GitHub: It builds a XML query for submission to InterMine Web Services, handling submitting and parsing the result.


It is not a coincidence that a diverse set of tools like InterMine, BioGateway and BioMart were used in building these interfaces. The collaborative environment at BioHackathon 2010 facilitated productive discussions with the authors of these projects, leading to the API development and implementation. If you are interested in more details about the week of programming, check out the day to day summaries:

You are invited to fork and extend the code on GitHub.

Written by Brad Chapman

February 15, 2010 at 11:33 am

Automated retrieval of expression data with Python and R

with 10 comments

In February, I’ll be headed to Japan for BioHackathon 2010, which focuses on integrating biological datasets. In preparation, I’ve been thinking about use cases: when working with a set of data, what other information do you want to able to readily retrieve that will help you better answer your questions? One common set of data is gene expression information; you identify a group of genes of interest and are curious about their expression levels.

In an ideal semantic world, you would be able to build a query through a set of RDF triples like:

  • Expression set -> has organism -> Mus musculus
  • Mus musculus -> has cell type -> proB
  • proB -> has biological state -> wild type
  • Expression set -> has identifier -> A RefGene ID
  • A RefGene ID -> has expression value -> 7.65

So you could find expression sets unique to your cell type under standard conditions, and then identify the expression values of your genes. Ideally, expression sets would also have additional meta-data associated with them so you’d know what 7.65 meant in the context of the experiment.

Although we can’t do this query through semantic channels, there are a number of toolkits that help us answer the question in an automated way. NCBI’s Gene Expression Omnibus (GEO) provides an online resource hosting expression data. You can query the data sets with EUtils web API using Biopython. Once a data set of interest is identified, Bioconductor’s GEOquery can retrieve the data sets and mappings to UCSC RefGene identifiers. All of this is tied together with the Rpy2 python interface to R.

At the top level, we define our goal, which is to retrieve expression data for mouse proB wild type cells:

def main():
    organism = "Mus musculus"
    cell_types = ["proB", "ProB", "pro-B"]
    email = ""
    save_dir = os.getcwd()
    exp_data = get_geo_data(organism, cell_types, email, save_dir,

def _is_wild_type(result):
    """Check if a sample is wild type from the title.
    return result.samples[0][0].startswith("WT")

We query GEO for the possible data sets based on our organism and cell type, and then search through the results for a dataset that matches our criteria. In real life work, your personal _is_wild_type function may be an interactive dialog that presents the biologist with possible experiments, allowing them to select the one of interest. Note also that we store our final result locally, as a Python pickle. Practically, this reduces our load on external servers by only querying them once:

def get_geo_data(organism, cell_types, email, save_dir, is_desired_result):
    save_file = os.path.join(save_dir, "%s-results.pkl" % cell_types[0])
    if not os.path.exists(save_file):
        results = cell_type_gsms(organism, cell_types, email)
        for result in results:
            if is_desired_result(result):
                with open(save_file, "w") as out_handle:
                    cPickle.dump(result, out_handle)

    with open(save_file) as save_handle:
        result = cPickle.load(save_handle)

The hard work of querying GEO and retrieving the results is done by Biopython’s Entrez interface. After building up the query, the results are parsed into simple objects with a description of the expression set along with titles and identifiers for each of the samples that match our cell type:

def cell_type_gsms(organism, cell_types, email):
    """Use Entrez to retrieve GEO entries for an organism and cell type.
    """ = email
    search_term = "%s[ORGN] %s" % (organism, " OR ".join(cell_types))
    print "Searching GEO and retrieving results: %s" % search_term
    hits = []
    handle = Entrez.esearch(db="gds", term=search_term)
    results =
    for geo_id in results['IdList']:
        handle = Entrez.esummary(db="gds", id=geo_id)
        summary =
        samples = []
        for sample in summary[0]['Samples']:
            for cell_type in cell_types:
                if sample['Title'].find(cell_type) >= 0:
                    samples.append((sample['Title'], sample['Accession']))
        if len(samples) > 0:
            hits.append(GEOResult(summary[0]['summary'], samples))
    return hits

Once the experiment is selected we’ve accomplished the first part of our goal. Next, the plan is to get expression data values. We do this by building up a dictionary that maps RefGene IDs to expression values, so the results look like:

WT ProB, biological rep1 
[('NM_177327', [7.7430266269999999, 6.4795213670000003, 8.8766985500000004]), 
 ('NM_001008785', [7.4671954649999996, 5.4761453329999998]),
 ('NM_177325', [7.3312364040000002, 11.831475960000001]),
 ('NM_177323', [6.9779868059999997, 6.3926399939999996]),
 ('NM_177322', [5.0833683379999997])]

The actual hard work is retrieving the expression data and mappings to gene identifiers, and this is done using the GEOquery library in Bioconductor. Mapping information for expression values, and high level meta-data about the experiment, are stored in local files. The expression information is stored as a R style tab delimited data table, while the meta-data is stored in JSON format. Both sets of data are then present locally in readily readable formats for further processing:

def _write_gsm_map(self, gsm_id, meta_file, table_file):
    """Retrieve GEO expression values using Bioconductor, saving to a table.
    robjects.r.assign("", gsm_id)
    robjects.r.assign("table.file", table_file)
    robjects.r.assign("meta.file", meta_file)
      gsm <- getGEO(
      write.table(Table(gsm), file = table.file, sep = "\t", row.names = FALSE,
                  col.names = TRUE)
      cat(toJSON(Meta(gsm)), file = meta.file)

The previous function provides a mapping between probe names and expression levels. To be useful, we also need to map each of the probes on the expression array to biological gene identifiers. This is done through a second call to the GEO library to retrieve details for the expression platform. Again the data is presented in a R data frame, which we limit to only the items of interest and save as a tab delimited file:

def _write_gpl_map(self, gpl_id, gpl_file):
    """Retrieve GEO platform data using R and save to a table.
    robjects.r.assign("", gpl_id)
    robjects.r.assign("gpl.file", gpl_file)
      gpl <- getGEO( <- subset(Table(gpl), select=c("ID", "RefSeq.Transcript.ID"))
      write.table(, file = gpl.file, sep = "\t", row.names = FALSE,
                  col.names = TRUE)

Putting this all together, we download each of the mapping files and then parse them, building the connection: expression to probe ID to gene transcript IDs. The final dictionary maps these gene identifiers to the expression values and returns them:

def get_gsm_tx_values(self, gsm_id, save_dir):
    """Retrieve a map of transcripts to expression from a GEO GSM file.
    gsm_meta_file = os.path.join(save_dir, "%s-meta.txt" % gsm_id)
    gsm_table_file = os.path.join(save_dir, "%s-table.txt" % gsm_id)
    if (not os.path.exists(gsm_meta_file) or 
            not os.path.exists(gsm_table_file)):
        self._write_gsm_map(gsm_id, gsm_meta_file, gsm_table_file)

    with open(gsm_meta_file) as in_handle:
        gsm_meta = json.load(in_handle)
    id_to_tx = self.get_gpl_map(gsm_meta['platform_id'], save_dir)
    tx_to_vals = collections.defaultdict(list)
    with open(gsm_table_file) as in_handle:
        reader = csv.reader(in_handle, dialect='excel-tab') # header
        for probe_id, probe_val in reader:
            for tx_id in id_to_tx.get(probe_id, []):
    return tx_to_vals

The full script puts all of these parts together into a working example. In the end we go from our query to the expression data we need, although the route is a bit more circuitous then the idealized one described at the start of the post. Simplifying this type of question based approach to data access is one challenge bioinformatics tool developers will continue to face.

Written by Brad Chapman

January 2, 2010 at 5:43 pm

Location and duplication information from Ensembl

with 2 comments

Understanding the evolutionary history of a gene family may provide some insight into the mechanisms of its activity. Here we will look at two characteristics for a list of eukaryotic genes of interest:

  • Groups that are co-located on a chromosome.
  • Groups evolutionarily related through duplication (paralogs).

By expanding on work done in an earlier post, we will identify co-located and duplicated genes using information from the Ensembl genome browser.

Using our previous method of screen scraping with Beautiful Soup, we extend our Ensembl REST-like class to include two new functions. Location is represented as a list of chromosome name, start and end, and parsed from the Gene Summary page:

def location(self, organism, gene_id):
    with self._get_open_handle("Gene", "Summary",
            organism, gene_id) as in_handle:
        soup = BeautifulSoup(in_handle)
        loc_tab = soup.find("dd", id="tab_location")
        link = loc_tab.find("a")
        path, attrs = urllib2.splitattr(link["href"])
        for attr in attrs:
            if attr.find("r=") == 0:
                key, val = attr.split("=")
                chrom, location = val.split(":")
                start, end = location.split("-")
                return chrom, int(start), int(end)
        raise ValueError("Did not find location: %s" % link)

Similarly, we can retrieve the details on paralogs from the Ensembl Comparative pages for the genes. With these two functions, a list of the protein IDs of interest, and a dictionary containing species and Ensembl ID references, we collect all of the location and duplication information into python dictionaries. At the same time we maintain a backwards mapping of Ensembl IDs to our original ID:

ensembl_retriever = EnsemblComparaRest(cache_dir)
loc_info = dict()
dup_info = dict()
ensembl_to_uniprot = dict()
for cur_id in all_ids:
    cur_rec = db[cur_id]
    cur_ensembl_org = cur_rec["org_scientific_name"].replace(" ", "_")
    for ensembl_id in cur_rec.get("db_refs_ensembl", []):
        paralogs = ensembl_retriever.paralogs(cur_ensembl_org, ensembl_id)
        chromosome, start, end = ensembl_retriever.location(
                cur_ensembl_org, ensembl_id)
        dup_info[ensembl_id] = paralogs
        loc_info[ensembl_id] = (cur_rec["org_scientific_name"],
            chromosome, start, end)
        ensembl_to_uniprot[ensembl_id] = cur_id

Now we want to flatten the paralog dictionary into a list of groups associated by duplication. This is done by using python sets; any groups with shared genes are combined and the resulting unique list is returned:

def examine_paralogs(dup_info, ensembl_to_uniprot):
    cur_groups = []
    all_base = dup_info.keys()
    for base_id, dup_ids in dup_info.items():
        overlap = set(dup_ids) & set(all_base)
        if len(overlap) > 0:
            new_group = set([ensembl_to_uniprot[x] for x in overlap |
            is_unique = True
            for exist_i, exist_group in enumerate(cur_groups):
                if len(new_group & exist_group) > 0:
                    update_group = new_group & exist_group
                    cur_groups[exist_i] = update_group
                    is_unique = False
            if is_unique:
    return [list(g) for g in cur_groups]

We use two functions to similarly group together genes by location. The first function calculates the actual distance given the previously retrieved location information list of (organism, chromosome, start, end). Two items in different organisms or on different chromosomes are placed maximally far apart since they can’t be co-located.

def location_distance(loc_one, loc_two):
    if loc_one[:2] != loc_two[:2] or loc_one == loc_two:
        return sys.maxint
        return max(abs(loc_one[3] - loc_two[2]),
                   abs(loc_two[3] - loc_one[2]))

The next function creates a dictionary of genes co-located based on our threshold, and then uses the examine_paralogs flattening function we defined above to generate groups from this dictionary. The variable distance_thresh defines the distance in which two genes are considered co-located. For the example script, 1 megabase is used, but this can be adjusted according to your own personal definition of close.

def examine_location(loc_info, ensembl_to_uniprot, distance_thresh):
loc_close = collections.defaultdict(lambda: [])
for uniprot_id, loc_one in loc_info.items():
for cmp_id, loc_two in loc_info.items():
if location_distance(loc_one, loc_two) <= distance_thresh: loc_close[uniprot_id].append(cmp_id) return examine_paralogs(loc_close, ensembl_to_uniprot) [/sourcecode]

The full script takes a file input with each line being a uniprot ID and gene name, and also requires a UniProt shelve database like we developed earlier. This shelve database provides the base ID to ensembl ID mappings and organism information, which can be parsed from UniProt XML files.

The resulting co-location and duplication groups are sets of genes which may share an intriguing evolutionary history. The paralogs have been computationally determined to be evolutionarily related based on sequence similarity and the composition of their gene family tree. The co-located genes may be co-selected during evolution, be the result of localized gene duplication obscured by extensive sequence change, or simply be close together based on chance. Further examination of your gene family in light of this information can help determine which of these hypotheses to favor.

Written by Brad Chapman

January 31, 2009 at 3:11 pm

Extracting protein characteristics for an InterPro domain

with 4 comments

The InterPro database provides a common language and organization for characterized protein domains. Here, the goal is to extract all proteins containing a domain of interest, limit these proteins to those in the animal (Metazoa) kingdom, and extract information about the resulting proteins. Protein information will be retrieved from the UniProt database.

The first step is identifying the proteins of interest with a given domain. An example is the chromo domain motif with InterPro identifier IPR000953. InterPro provides a useful REST interface, allowing us to download the full list of related proteins in FASTA format and parse them using the Biopython SeqIO interface. For our example, this provides over 2500 records; here is a snippet of the class function that does this work.

def get_interpro_records(self, ipr_number):
    url_base = "%s/interpro/ISpy?ipr=%s&mode=fasta"
    full_url = url_base % (self._server, ipr_number)
    recs = []
    with self._get_open_handle(full_url) as in_handle:
        for rec in SeqIO.parse(in_handle, "fasta"):
    return recs

UniProt provides an excellent REST interface and XML format which help simplify the retrieval process. Parsing the XML records with the python ElementTree parser allows us to quickly pull out the organism name and evolutionary lineage.

import xml.etree.ElementTree as ET

def get_xml_metadata(self, uniprot_id):
    url_base = "%s/uniprot/%s.xml"
    full_url = url_base % (self._server, uniprot_id)
    metadata = {}
    with self._get_open_handle(full_url) as in_handle:
        root = ET.parse(in_handle).getroot()
        org = root.find("%sentry/%sorganism" % (self._xml_ns, self._xml_ns))
        for org_node in org:
            if org_node.tag == "%sname" % self._xml_ns:
                if org_node.attrib["type"] == "scientific":
                    metadata["org_scientific_name"] = org_node.text
                elif org_node.attrib["type"] == "common":
                    metadata["org_common_name"] = org_node.text
            elif org_node.tag == "%slineage" % self._xml_ns:
                metadata["org_lineage"] = [n.text for n in org_node]
    return metadata

Putting all the parts together, we loop over the list of Biopython sequence record objects, extract the UniProt ID, retrieve the metadata from UniProt, and store this in a local python shelve database:

cache_dir = os.path.join(os.getcwd(), "cache")
db_dir = os.path.join(os.getcwd(), "db")
interpro_retriever = InterproRestRetrieval(cache_dir)
uniprot_retriever = UniprotRestRetrieval(cache_dir)
cur_db =, ipr_number))
seq_recs = interpro_retriever.get_interpro_records(ipr_number)
for seq_rec in seq_recs:
    uniprot_id ="|")[0]
    metadata = uniprot_retriever.get_xml_metadata(uniprot_id)
    if (metadata.has_key("org_lineage") and 
            "Metazoa" in metadata["org_lineage"]):
        metadata["seq"] =
        cur_db[uniprot_id] = metadata

The data is stored as a dictionary attached to the individual UniProt identifiers. This is modeled after the boto SimpleDB library, which provides a python interface to storage in Amazon’s SimpleDB.

All of these code snippets in action can be found in this example script, which helps place the code sections above in context. In future weeks, we will try and pull some interesting information from the protein domain families.

Written by Brad Chapman

January 10, 2009 at 4:22 pm

Comparative genomics information retrieval from Ensembl

leave a comment »

The Ensembl website provides a powerful front end to genomic data from a wide variety of eukaryotic species. Additionally, the Comparative Genomics initiative provides automated phylogenetic analyses for comprehensive examination of gene families. This post describes retrieving comparative protein details for a human gene of interest using the tried and true method of screen scraping, presenting the data in a method ready to be presented via a REST interface.

Several other people have had similar notions for retrieving Ensembl data. Pedro describes an example using openKapow, and Andrew uses Dapper.

Here, we deal with the Ensembl web pages using Beautiful Soup, a Python web parser that simplifies the work of web page retrieval. The idea is to generate an interface that could be readily abstracted to a set of REST web pages. This would greatly simplify information retrieval from Ensembl; retrieving a set of orthologs for a gene ID would involve a workflow like:

  • Prepare a URL like:
  • Parse out a simple text result in CSV:

For queries that can be expressed with a few inputs and readily understandable outputs, this would provide programmatic access to Ensembl data without the overhead of installing the Perl API. Below is a function which retrieves the pages, parses them with Beautiful Soup, and returns the simplified information. To wrap this into a REST interface described above, would require adding a layer on top using a Python web framework like Pylons.

def orthologs(self, organism, gene_id):
    """Retrieve a list of orthologs for the given gene ID.
    orthologs = []
    with self._get_open_handle("Gene", "Compara_Ortholog",
            organism, gene_id) as in_handle:
        soup = BeautifulSoup(in_handle)
        orth_table = soup.find("table", "orthologues")
        orth_links = orth_table.findAll("a", 
                href = re.compile("Gene/Summary"))
        for orth_link in orth_links:
            href_parts = [x for x in orth_link['href'].split('/') if x]
            orthologs.append((href_parts[0], orth_link.string))
    return orthologs

The full example script takes an organism and gene ID as input and displays homologs in Ensembl along with distance from the initial organism, protein domains, and features of the protein. The script uses Biopython, the Newick Tree Parser to parse phylogenetic trees, and NetworkX to calculate phylogenetic distances from the tree.

> python2.5 Homo_sapiens ENSG00000173894
Homo_sapiens 0 [u'IPR016197', u'IPR000953'] 38.0
Pan_troglodytes 0.009 [u'IPR016197', u'IPR000637', u'IPR000953'] 38.0
Gorilla_gorilla 0.0169 [u'IPR016197', u'IPR000637', u'IPR000953'] 
Macaca_mulatta 0.0538 [u'IPR000637', u'IPR000953'] 36.0
Tarsius_syrichta 0.1622 [u'IPR000637'] 
Microcebus_murinus 0.1848 [u'IPR000637', u'IPR000953'] 33.5

This demonstrates using the framework to look at the change in domains and protein charge across a gene family. The general idea of the EnsemblComparaRest< class could be applied to other information of interest available from Ensembl web pages.

Written by Brad Chapman

December 28, 2008 at 10:14 pm