Python query interface to BioGateway SPARQL endpoint and InterMine
I spent the last week in Tokyo, Japan at BioHackathon 2010, an extremely productive meet up of open source coders organized by Toshiaki Katayama and the great folks at the Database Center of Life Sciences (DBCLS). The focus of the week was improving biological toolkits for accessing the Semantic Web of linked data.
The technical focus was on RDF (Resource Description Format), a standard way to represent data as a triple; each triple is made up of a subject defining the item, a predicate specifying a relationship, and a object representing the linked data. By providing data in RDF along with common naming schemes for common objects we facilitate linking biological data in mashups, expanding the ability of researchers to discover relationships between disparate data sources.
RDF is managed in data stores like Virtuoso, which are equivalent to relational or document based databases. For programmers, the primary technology for querying these stores is SPARQL, a query language similar to SQL. The goal of the Biopython programming team at the Hackathon was to provide an easy to use Python library to query RDF stores through SPARQL.
The python interface organizes common biological queries on a datastore without exposing the backend implementation details. The final interface simplifies the process of making a query to two steps:
- Instantiate a query builder, providing it with two sets of data: attributes to retrieve and items to filter against. This is modeled after BioMart querying and the R biomaRt interface, providing a generic, well understood way to specify the query.
- Pass the query builder to a retrieval server which executes the query and returns the results in a tabular format, as a numpy RecordArray.
The user is never exposed to the underlying implementation details, as the library performs the work of building the query, submitting it to the remote server and reformatting the results.
Two interfaces were implemented at BioHackathon 2010:
- BioGateway — a SPARQL endpoint to a RDF data store containing SwissProt data semantically linked to Gene Ontology (GO) terms.
- InterMine — a XML based interface to a traditional relational database backend, containing organized metadata for primary experimental data from many organisms.
By providing a common interface for both semantic data and more traditional data sources, we hope to facilitate the conversion by data providers to RDF where it simplifies their backend storage and queries. Users of this high level interface do not need to worry about the underlying implementation, instead focusing resources on developing their biological queries.
BioGateway organizes the SwissProt database of protein sequences along with Gene Ontology Annotations (GOA) into an integrated RDF database. Data access is provided through a SPARQL query endpoint, allowing searches for proteins based on a combination of GO and SwissProt data.
This query searches for proteins that are involved in insulin response and linked to diabetes. The protein name, other proteins associated via protein-protein interaction, and the gene name are retrieved:
from systemsbio import Biogateway, UniProtGOQueryBuilder builder = UniProtGOQueryBuilder("Homo sapiens") builder.add_filter("GO_term", "insulin") builder.add_filter("disease_description", "diabetes") builder.add_attributes(["protein_name", "interactor", "gene_name"]) server = Biogateway() results = server.search(builder) print len(results), results.dtype.names result = results print result['protein_name'], result['gene_name'], \ result['interactor'], result['GO_term']
An orthogonal search approach is to start with a protein of interest and retrieve linked details. Here we identify primary journal papers about a protein:
from systemsbio import Biogateway, ReferenceBuilder builder = ReferenceBuilder() builder.add_filter("protein_id", "1433B_HUMAN") builder.add_attributes(["reference"]) server = Biogateway() results = server.search(builder) print len(results), results.dtype.names result = results print result['protein_id'], result['reference']
from Bio import Entrez Entrez.email = "firstname.lastname@example.org" pubmed_id = result['reference'].replace("PMID_", "") handle = Entrez.esummary(db="pubmed", id=pubmed_id) record = Entrez.read(handle) print record['Title'] print record['PubDate'] print ",".join(record['AuthorList']) print record['FullJournalName'], record['Volume'], record['Pages'] # Novel raf kinase protein-protein interactions found by an exhaustive yeast two-hybrid analysis. # 2003 Feb # Yuryev A,Wennogle LP # Genomics 81 112-25
The full source code is available from GitHub: systemsbio.py. The implementation builds a SPARQL query based on the provided attributes and filters, using SPARQLwrapper to interact with the remote server and parse the results.
InterMine is an open source data management system used to power databases of primary research results like FlyMine and modMine. It stores metadata associated with projects in a structured way, enabling searches to identify data submissions of interest to biologists. It contains two useful web based tools to facilitate these searches:
Templates — Pre-defined queries that capture common ways biologists search the database.
Query builder — A graphical interface to define custom queries, allowing manual discovery of attributes of interest.
We access InterMine programatically using the same builder and server paradigms used in our BioGateway interface. The query below searches modMine for C. elegans experiments characterizing the H3K4Me3 histone modification, which is associated with chromatin structure in active genes. The returned submission identifiers can be used to examine the primary data associated with the experiment:
from intermine import Intermine, SubmissionQueryBuilder builder = SubmissionQueryBuilder() builder.add_attributes(["submission_id", "submission_title", "developmental_stage"]) builder.add_filter("organism", "Caenorhabditis elegans") builder.add_filter("antibody_name", "H3K4me3") server = Intermine("http://intermine.modencode.org") table = server.search(builder) print table.dtype.names print table # ('submission_id', 'submission_title', 'developmental_stage') # [('176', 'Histone_H3K4me3_from_N2_L3_larvae_(AR0169_H3K4me3_N2_L3)', '') # ('2311', 'Histone_H3K4me3_N2_EEMB_(WA30534819_H3K4ME3_N2_EEMB)', # 'Early Stage Embryos') # ('2410', 'Histone_H3K79me1_N2_EEMB_(AB2886_H3K79ME1361912_N2_EEMB)', # 'Early Stage Embryos')]
An advantage of defining query builders is that we can provide custom functionality to access more complex queries. The code below searches for C. elegans ChIP-seq experiments using a free text search. The implementation searches for the query term against several description fields in the database, hiding these details from the user:
from intermine import Intermine, ExperimentQueryBuilder builder = ExperimentQueryBuilder() builder.add_attributes(["submission_id", "experiment_name"]) builder.add_filter("organism", "Caenorhabditis elegans") builder.free_text_filter("ChIP-seq") server = Intermine("http://intermine.modencode.org") table = server.search(builder) print table.dtype.names print table # ('submission_id', 'experiment_name') # [('582', 'ChIP-Seq Identification of C. elegans TF Binding Sites') # ('584', 'ChIP-Seq Identification of C. elegans TF Binding Sites') # ...
It is not a coincidence that a diverse set of tools like InterMine, BioGateway and BioMart were used in building these interfaces. The collaborative environment at BioHackathon 2010 facilitated productive discussions with the authors of these projects, leading to the API development and implementation. If you are interested in more details about the week of programming, check out the day to day summaries:
You are invited to fork and extend the code on GitHub.