Automated retrieval of expression data with Python and R
In February, I’ll be headed to Japan for BioHackathon 2010, which focuses on integrating biological datasets. In preparation, I’ve been thinking about use cases: when working with a set of data, what other information do you want to able to readily retrieve that will help you better answer your questions? One common set of data is gene expression information; you identify a group of genes of interest and are curious about their expression levels.
In an ideal semantic world, you would be able to build a query through a set of RDF triples like:
- Expression set -> has organism -> Mus musculus
- Mus musculus -> has cell type -> proB
- proB -> has biological state -> wild type
- Expression set -> has identifier -> A RefGene ID
- A RefGene ID -> has expression value -> 7.65
So you could find expression sets unique to your cell type under standard conditions, and then identify the expression values of your genes. Ideally, expression sets would also have additional meta-data associated with them so you’d know what 7.65 meant in the context of the experiment.
Although we can’t do this query through semantic channels, there are a number of toolkits that help us answer the question in an automated way. NCBI’s Gene Expression Omnibus (GEO) provides an online resource hosting expression data. You can query the data sets with EUtils web API using Biopython. Once a data set of interest is identified, Bioconductor’s GEOquery can retrieve the data sets and mappings to UCSC RefGene identifiers. All of this is tied together with the Rpy2 python interface to R.
At the top level, we define our goal, which is to retrieve expression data for mouse proB wild type cells:
def main():
organism = "Mus musculus"
cell_types = ["proB", "ProB", "pro-B"]
email = "chapmanb@50mail.com"
save_dir = os.getcwd()
exp_data = get_geo_data(organism, cell_types, email, save_dir,
_is_wild_type)
def _is_wild_type(result):
"""Check if a sample is wild type from the title.
"""
return result.samples[0][0].startswith("WT")
We query GEO for the possible data sets based on our organism and cell type, and then search through the results for a dataset that matches our criteria. In real life work, your personal _is_wild_type function may be an interactive dialog that presents the biologist with possible experiments, allowing them to select the one of interest. Note also that we store our final result locally, as a Python pickle. Practically, this reduces our load on external servers by only querying them once:
def get_geo_data(organism, cell_types, email, save_dir, is_desired_result):
save_file = os.path.join(save_dir, "%s-results.pkl" % cell_types[0])
if not os.path.exists(save_file):
results = cell_type_gsms(organism, cell_types, email)
for result in results:
if is_desired_result(result):
with open(save_file, "w") as out_handle:
cPickle.dump(result, out_handle)
break
with open(save_file) as save_handle:
result = cPickle.load(save_handle)
The hard work of querying GEO and retrieving the results is done by Biopython’s Entrez interface. After building up the query, the results are parsed into simple objects with a description of the expression set along with titles and identifiers for each of the samples that match our cell type:
def cell_type_gsms(organism, cell_types, email):
"""Use Entrez to retrieve GEO entries for an organism and cell type.
"""
Entrez.email = email
search_term = "%s[ORGN] %s" % (organism, " OR ".join(cell_types))
print "Searching GEO and retrieving results: %s" % search_term
hits = []
handle = Entrez.esearch(db="gds", term=search_term)
results = Entrez.read(handle)
for geo_id in results['IdList']:
handle = Entrez.esummary(db="gds", id=geo_id)
summary = Entrez.read(handle)
samples = []
for sample in summary[0]['Samples']:
for cell_type in cell_types:
if sample['Title'].find(cell_type) >= 0:
samples.append((sample['Title'], sample['Accession']))
break
if len(samples) > 0:
hits.append(GEOResult(summary[0]['summary'], samples))
return hits
Once the experiment is selected we’ve accomplished the first part of our goal. Next, the plan is to get expression data values. We do this by building up a dictionary that maps RefGene IDs to expression values, so the results look like:
WT ProB, biological rep1
[('NM_177327', [7.7430266269999999, 6.4795213670000003, 8.8766985500000004]),
('NM_001008785', [7.4671954649999996, 5.4761453329999998]),
('NM_177325', [7.3312364040000002, 11.831475960000001]),
('NM_177323', [6.9779868059999997, 6.3926399939999996]),
('NM_177322', [5.0833683379999997])]
The actual hard work is retrieving the expression data and mappings to gene identifiers, and this is done using the GEOquery library in Bioconductor. Mapping information for expression values, and high level meta-data about the experiment, are stored in local files. The expression information is stored as a R style tab delimited data table, while the meta-data is stored in JSON format. Both sets of data are then present locally in readily readable formats for further processing:
def _write_gsm_map(self, gsm_id, meta_file, table_file):
"""Retrieve GEO expression values using Bioconductor, saving to a table.
"""
robjects.r.assign("gsm.id", gsm_id)
robjects.r.assign("table.file", table_file)
robjects.r.assign("meta.file", meta_file)
robjects.r('''
library(GEOquery)
library(rjson)
gsm <- getGEO(gsm.id)
write.table(Table(gsm), file = table.file, sep = "\t", row.names = FALSE,
col.names = TRUE)
cat(toJSON(Meta(gsm)), file = meta.file)
''')
The previous function provides a mapping between probe names and expression levels. To be useful, we also need to map each of the probes on the expression array to biological gene identifiers. This is done through a second call to the GEO library to retrieve details for the expression platform. Again the data is presented in a R data frame, which we limit to only the items of interest and save as a tab delimited file:
def _write_gpl_map(self, gpl_id, gpl_file):
"""Retrieve GEO platform data using R and save to a table.
"""
robjects.r.assign("gpl.id", gpl_id)
robjects.r.assign("gpl.file", gpl_file)
robjects.r('''
library(GEOquery)
gpl <- getGEO(gpl.id)
gpl.map <- subset(Table(gpl), select=c("ID", "RefSeq.Transcript.ID"))
write.table(gpl.map, file = gpl.file, sep = "\t", row.names = FALSE,
col.names = TRUE)
''')
Putting this all together, we download each of the mapping files and then parse them, building the connection: expression to probe ID to gene transcript IDs. The final dictionary maps these gene identifiers to the expression values and returns them:
def get_gsm_tx_values(self, gsm_id, save_dir):
"""Retrieve a map of transcripts to expression from a GEO GSM file.
"""
gsm_meta_file = os.path.join(save_dir, "%s-meta.txt" % gsm_id)
gsm_table_file = os.path.join(save_dir, "%s-table.txt" % gsm_id)
if (not os.path.exists(gsm_meta_file) or
not os.path.exists(gsm_table_file)):
self._write_gsm_map(gsm_id, gsm_meta_file, gsm_table_file)
with open(gsm_meta_file) as in_handle:
gsm_meta = json.load(in_handle)
id_to_tx = self.get_gpl_map(gsm_meta['platform_id'], save_dir)
tx_to_vals = collections.defaultdict(list)
with open(gsm_table_file) as in_handle:
reader = csv.reader(in_handle, dialect='excel-tab')
reader.next() # header
for probe_id, probe_val in reader:
for tx_id in id_to_tx.get(probe_id, []):
tx_to_vals[tx_id].append(float(probe_val))
return tx_to_vals
The full script puts all of these parts together into a working example. In the end we go from our query to the expression data we need, although the route is a bit more circuitous then the idealized one described at the start of the post. Simplifying this type of question based approach to data access is one challenge bioinformatics tool developers will continue to face.
Gene Ontology analysis with Python and Bioconductor
Last post we discussed extracting differentially expressed genes from a set of short read data. The output of this analysis is a large list of genes, and the next challenge is to try and extract interesting patterns from the results. Are genes in particular families over-represented? Are there a large number of genes with a particular function? These patterns can help direct your future experiments and provide an understandable grouping for biologists examining the data.
The Gene Ontology (GO) project provides a standardized set of terms describing the molecular function of genes. We will use the topGO package from the Bioconductor project to identify over-represented GO terms from a set of differentially expressed genes. Python will be used to prepare the data, utilizing rpy2 to call R for the statistical analysis.
The first step is to parse input files describing the differentially expressed genes and the mapping of gene names to GO terms. For the example we will use a simple CSV file from our previous analysis and an equally simple file describing the gene to GO mapping. For real life work, you will need to get the GO mapping file for your organism; for instance, here is the Arabidopsis GO mapping from TAIR.
The input file of differentially expressed genes is parsed into a dictionary where the keys are gene names and the values are associated p-values:
import csv
def parse_input_csv(in_handle):
reader = csv.reader(in_handle)
reader.next() # header
all_genes = dict()
for (gene_name, _, _, pval) in reader:
all_genes[gene_name] = float(pval)
return all_genes
The GO mapping input file is parsed into two dictionaries: one mapping genes to GO terms for the GO analysis, and the second mapping GO terms to genes for displaying the results of the analysis:
import collections
def parse_go_map_file(in_handle, genes_w_pvals):
gene_list = genes_w_pvals.keys()
gene_to_go = collections.defaultdict(list)
go_to_gene = collections.defaultdict(list)
for line in in_handle:
parts = line.split("\t")
gene_id = parts[0]
go_id = parts[1].strip()
if gene_id in gene_list:
gene_to_go[gene_id].append(go_id)
go_to_gene[go_id].append(gene_id)
return dict(gene_to_go), dict(go_to_gene)
Next we set up our associated R environment for GO analysis. This involves loading the topGO library and creating a function to classify genes as either differentially expressed, or not. We do this based on the p-value, classifying genes with a p-value of 0.01 or less as differentially expressed:
import rpy2.robjects as robjects
gene_pval = 1e-2
robjects.r('''
library(topGO)
''')
robjects.r('''
topDiffGenes = function(allScore) {
return (allScore < %s)
}
''' % gene_pval)
Now it’s time to run the analysis. The paramters are defined as a dictionary which initializes a topGOdata object:
- ontology — the type of GO ontology to analyze; molecular function (MF) is examined here.
- annot — how the GO terms are annotated; we specify they will be identified by a mapping of genes to GO ids, passed as the gene2GO argument.
- geneSelectionFun — the function we defined above to determine if genes are differentially expressed.
- allGenes — an R named vector where the names are genes and the values are p-values; we use a utility function to convert a python dictionary to the named vector.
- gene2GO — an R named vector mapping gene names to associated GO terms.
The initialized object is run using Fisher classic analysis. Several different analysis methods and test statistics are available in the package. The resulting scores are extracted and a summary table of the results is generated:
def _dict_to_namedvector(init_dict):
"""Call R to create a named vector from an input dictionary.
"""
return robjects.r.c(**init_dict)
go_term_type = "MF"
topgo_method = "classic" # choice of classic, elim, weight
params = {"ontology" : go_term_type,
"annot" : robjects.r["annFUN.gene2GO"],
"geneSelectionFun" : robjects.r["topDiffGenes"],
"allGenes" : _dict_to_namedvector(gene_vals),
"gene2GO" : _dict_to_namedvector(gene_to_go)
}
go_data = robjects.r.new("topGOdata", **params)
results = robjects.r.runTest(go_data, algorithm=topgo_method,
statistic="fisher")
scores = robjects.r.score(results)
score_names = scores.getnames()
num_summarize = min(100, len(score_names))
results_table = robjects.r.GenTable(go_data, elimFisher=results,
orderBy="elimFisher", topNodes=num_summarize)
The scores and results table are used to generate a list of over-represented GO terms with their associated p-values and descriptions:
# extract term names from the topGO summary dataframe
GO_ID_INDEX = 0
TERM_INDEX = 1
ids_to_terms = dict()
for index, go_id in enumerate(results_table[GO_ID_INDEX]):
ids_to_terms[go_id] = results_table[TERM_INDEX][index]
go_terms = []
# convert the scores and results information info terms to return
for index, item in enumerate(scores):
if item < go_pval:
go_id = score_names[index]
go_terms.append((item, go_id, ids_to_terms.get(go_id, "")))
go_terms.sort()
Finally, we print out the resulting overexpressed GO terms along with associated genes:
def print_go_info(go_terms, go_term_type, go_to_gene):
for final_pval, go_id, go_term in go_terms:
genes = []
for check_go in [go_id] + get_go_children(go_id, go_term_type):
genes.extend(go_to_gene.get(check_go, []))
genes = sorted(list(set(genes)))
print "-> %s (%s) : %0.4f" % (go_id, go_term, final_pval)
for g in genes:
print g
One tricky part of the post-analysis processing is finding all of the genes associated with an identified GO id. The GO terminology is hierarchical, so a resulting GO identifier like GO:0042578 in the example results file can represent both that term and more specific terms. A nice way to illustrate this for yourself is to look at the AmiGO browser display for the term.
The result is that genes associated with an over-represented term will come from both the identified term and more specific terms. Bioconductor has an interface to the GO database, so we can extract all of the relevant terms by moving down the GO tree collecting the more specific children, given the parent term:
def get_go_children(go_term, go_term_type):
robjects.r('''
library(GO.db)
''')
child_map = robjects.r["GO%sCHILDREN" % (go_term_type)]
children = []
to_check = [go_term]
while len(to_check) > 0:
new_children = []
for check_term in to_check:
new_children.extend(list(robjects.r.get(check_term, child_map)))
new_children = list(set([c for c in new_children if c]))
children.extend(new_children)
to_check = new_children
children = list(set(children))
return children
The full script pulls all of this code together into a working example. You will need to update the parsers to match your input and GO mapping file, but otherwise the code can plug directly into full genome analyses of differential expression data.
Differential expression analysis with Bioconductor and Python
This post demonstrates performing differential expression analysis of short read sequencing data using a combination of Python and the R statistical language. Python is used a glue language to manipulate and prepare count data from short read sequencing. R and the Bioconductor package are used to perform the statistical analysis. The excellent rpy2 package connection Python and R.
Scientists often need to compare expression results from multiple experimental conditions. This can be done with microarray analysis and the wide variety of associated statistical tests, but many researchers are now utilizing short read sequencing. Experiments identify transcript prevalence through digital gene expression under multiple conditions, and we are interested in extracting statistically meaningful differences from the results. Similarly, we may be interested in differences resulting from short RNA discovery or other next generation sequencing applications.
Short read differential expression analysis differs from microarray analyses as it is based on counts instead of intensity scores. The edgeR package in Bioconductor fits count data to a statistical model considering sample to sample variation, and performs Fisher’s exact test to provide p-values associated with changes in expression between samples.
Here we will consider the case of a two experiment analysis. A simple Example CSV file has counts for 5 different genes under 2 conditions and total count information for the full experiment. We read the file into a dictionary of counts keyed by each condition and the gene name:
import csv
import collections
def read_count_file(in_file):
"""Read count information from a simple CSV file into a dictionary.
"""
counts = collections.defaultdict(dict)
with open(in_file) as in_handle:
reader = csv.reader(in_handle)
header = reader.next()
conditions = header[1:]
for parts in reader:
region_name = parts[0]
region_counts = [float(x) for x in parts[1:]]
for ci, condition in enumerate(conditions):
counts[condition][region_name] = region_counts[ci]
return dict(counts)
The dictionary is organized into NumPy matrices; NumPy is a powerful numerical package for Python which integrates smoothly with our rpy2 interface to import directly into R. Here we organize our conditions and genes and then push the count data into a matrix where the columns are conditions and the rows are genes; each item is a count value. This is returned along with associated data:
- groups — A list of experimental groups, which can be used to analyze replicates. Here, two groups with a single replicate each are defined.
- sizes — The total number of counts for each experiment. This is extracted from the “Total” row of the CSV count file, and will be used for normalization of during the statistical analysis.
import numpy
def get_conditions_and_genes(work_counts):
conditions = work_counts.keys()
conditions.sort()
all_genes = []
for c in conditions:
all_genes.extend(work_counts[c].keys())
all_genes = list(set(all_genes))
all_genes.sort()
sizes = [work_counts[c]["Total"] for c in conditions]
all_genes.remove("Total")
return conditions, all_genes, sizes
def edger_matrices(work_counts):
conditions, all_genes, sizes = get_conditions_and_genes(work_counts)
assert len(sizes) == 2
groups = [1, 2]
data = []
final_genes = []
for g in all_genes:
cur_row = [int(work_counts[c][g]) for c in conditions]
if sum(cur_row) > 0:
data.append(cur_row)
final_genes.append(g)
return (numpy.array(data), numpy.array(groups), numpy.array(sizes),
conditions, final_genes)
The organized data is now ready to be pushed directly into an edgeR Bioconductor analysis using the rpy2 interface. Three R functions are called: the data matrices are organized into a DGEList object, this object is passed to deDGE which does the actual digital gene expression analysis, and finally topTags is called to retrieve a vector of differential expression p-values. The vector is translated into a Python object from which we extract the p-values and re-organize them by the initial gene indexes. The ordered p-values are then returned.
import rpy2.robjects as robjects
import rpy2.robjects.numpy2ri
def run_edger(data, groups, sizes, genes):
robjects.r('''
library(edgeR)
''')
params = {'group' : groups, 'lib.size' : sizes}
dgelist = robjects.r.DGEList(data, **params)
ms = robjects.r.deDGE(dgelist, doPoisson=True)
tags = robjects.r.topTags(ms, pair=groups, n=len(genes))
indexes = [int(t) - 1 for t in tags.rownames()]
pvals = list(tags.r['adj.P.Val'][0])
assert len(indexes) == len(pvals)
pvals_w_index = zip(indexes, pvals)
pvals_w_index.sort()
assert len(pvals_w_index) == len(indexes)
return [p for i,p in pvals_w_index]
The final results are written to a CSV file with our ordered genes, conditions and p-value probabilities. Each gene is associated with the initial count data and p-values, the genes are sorted by p-value with the most differentially expressed genes placed first, and the ordered information is written out:
def write_outfile(outfile, genes, conditions, work_counts, probs):
with open(outfile, "w") as out_handle:
writer = csv.writer(out_handle)
writer.writerow(["Region"] +
["%s count" % c for c in conditions] + ["edgeR p-value"])
out_info = []
for i, gene in enumerate(genes):
counts = [int(work_counts[c][gene]) for c in conditions]
out_info.append((probs[i], [gene] + counts))
out_info.sort()
[writer.writerow(start + [prob]) for prob, start in out_info]
Our example output file shows the results. The full script is available for you to use and customize for your own analyses. This example demonstrates the power of the rpy2 interface. Input and output data is happily manipulated in a familiar language such as Python, and can be seamlessly integrated with excellent statistical computation in Bioconductor and other specialized R packages.
Sorting genomic alignments using Python
Recent, I was asked to sort a MAF (Multiple Alignment Format) file containing genomic alignments. This was a large vertebrate alignment file and the researchers were interested in manually examining the longest matches. The tricky part of doing this is that the entire file cannot comfortably fit in memory on a standard machine.
The common solution to dealing with large memory intensive files is to pre-parse the file and provide a way to look up individual records based on an identifier. For instance, items could be pushed into a relational database table and retrieved based on primary keys. While moving to disk access is less efficient, it allows your code to scale to these type of real life problems without resorting to buying large memory machines.
We will use the bx-python library to read the MAF alignment files, prepare an index for accessing the alignments individually, and ultimately perform the sorting.
The first step is to define the index retrieval method. bx-python has index retrival functionality that allows querying and retrieving records based on genomic location. We don’t need that advanced query capability here, and can retrieve records based on position in the file as well. We build the index by looping through each record and recording the file position and genomic locations in the index.
from bx.align import maf
from bx import interval_index_file
def build_index(in_file, index_file):
indexes = interval_index_file.Indexes()
with open(in_file) as in_handle:
reader = maf.Reader(in_handle)
while 1:
pos = reader.file.tell()
rec = reader.next()
if rec is None:
break
for c in rec.components:
indexes.add(c.src, c.forward_strand_start,
c.forward_strand_end, pos, max=c.src_size )
with open(index_file, "w") as index_handle:
indexes.write(index_handle)
Now we have a unique index to retrieve a record — the position in the file. The second step is to loop through the file again and build a list of alignment sizes and positions. Alignment size is the first value in each list item, since this is the criteria to sort by. This list easily fits in memory, and we sort it with the largest alignments first.
rec_info = []
with open(in_file) as in_handle:
reader = maf.Reader(in_handle)
while 1:
pos = reader.file.tell()
rec = reader.next()
if rec is None:
break
rec_info.append((rec.text_size, pos))
rec_info.sort(reverse=True)
Finally, we loop though the sorted list of sizes and use the associated positions to get records from our index. Each record is then written to an output file.
index = maf.Indexed(in_file, index_file)
with open(out_file, "w") as out_handle:
writer = maf.Writer(out_handle)
for size, pos in rec_info:
rec = index.get_at_offset(pos)
writer.write(rec)
The full script puts this together into a usable program. This could be adjusted to deal with other alignment formats supported by bx-python, like axt files.
Thoughts from BOSC 2009; Python in bioinformatics
I’d like to share my thoughts on two major themes that emerged from my trip to Sweden for the 2009 Bioinformatics Open Source Conference (BOSC). I talked briefly on publishing biological data on the web; you can check out the slides on Slideshare. This lead to discussion with several folks from the open source and Python bioinformatics communities. The major themes of these conversations were: organization within the Python bioinformatics community and growth of a platform for developing web enabled applications.
Python in Bioinformatics
One of the unique elements of the Python bioinformatics community is that the work is distributed amongst several different packages. Unlike Perl, where many programmers regularly consolidate their code into the BioPerl project, the Python community has settled on a few different packages: Biopython, bx-python, pygr and PyCogent are a few popular ones. Instead of working with a monolithic code base, users can pick functionality amongst several choices.
This distinctive organization is not a bad thing. It avoids creating an unwieldy package which can be hard to maintain and re-factor. It allows programmers to explore solutions in ways better suited to their particular problems. Finally, it provides individual recognition for the hard work researchers put into building and maintaining reusable code.
What the distribution does mean is that the Python bioinformatics community needs to work harder at communication and coordination. One great idea was to write a grant to help bring Python biology programmers together for a conference and hacking session, ideally alongside a conference like BOSC or SciPy. This would provide an impetus for contributors to learn and discuss each others platforms. Beyond goodwill and community, the deliverables would be documentation and code contributing to integration between projects. This would enable scientists by lowering the learning curves to producing useful biology related code in Python.
Galaxy
My talk focused on developing small, reusable presentation and backend components for building web enabled applications. One really insightful question asked whether the community should focus on building a platform these components could be plugged into, or if components themselves would eventually evolve towards a larger structure. The first idea was taken very successfully by the Firefox browser with their plugin architecture, which allows end users to build amazing web interfacing applications within the browser. The second approach is the one I am more used to taking: build relatively smaller things that work, with an eye towards integration.
A discussion with James Taylor convinced me that it was worthwhile to take a longer look at the Galaxy project. Galaxy is an excellent web based front end to many bioinformatics programs and scripts, allowing biologists to put together analysis pipelines. They have a powerful public site which means using Galaxy requires no installation for many use cases.
The code is also publicly available and you can run your own local Galaxy instances, plugging in custom programs through their XML based tool interface. The architecture of the system is remarkably similar to the one I converged on for my work. It features a Pylons like backend, SQLAlchemy for databases interactivity, and jQuery for the javascript interface. Deployment to cloud infrastructure can be done on Amazon EC2.
We have installed Galaxy locally and are taking it for a spin for our data presentation tasks. The tool plugin interface works as described and we have had good luck integrating it with custom input types. I will be trying more complex integrations with custom display and more Python code on the backend and hopefully have future posts covering that. Generally, I hope Galaxy can serve as a platform in which custom presentation code can be built, distributed, and reused.
I’d be happy to hear your thoughts about either the biology Python community or Galaxy as a platform for web presentation work.
Talking at BOSC 2009 about publishing biological data on the web
The Bioinformatics Open Source Conference (BOSC) is taking place later this month in Stockholm, Sweden. I will be attending for the first time in a few years, and giving a short and sweet 10 minute talk about ideas for publishing biological data on the web. BOSC provides a chance to meet and talk with many of the great people involved in open source bioinformatics; the schedule this year looks fantastic. The talk will be held in conjunction with The Data and Analysis Management special interest group, also full of interesting talks.
The talk will promote development of reusable web based interface libraries layered on top of existing open source projects. The PDF abstract provides the full description and motivation; below is a more detailed outline based on some brainstorming and organization:
- Motivation: rapidly organize and display biological data in a web accessible format.
- Current state: reusable bioinformatics libraries targeted at programmers — Biopython, bx-python, pygr, PyCogent
- Current state: back end databases for storing biological data — BioSQL, GMOD
- Current state: full featured web applications targeted at users — Galaxy, GBrowse
- My situation: biologist and developer with organized data that needs analysis and presentation, internally with collaborators and externally with larger community.
- Proposal: integrate bioinformatics libraries, database schemas, and open source web development frameworks to provide re-usable components that can serve as a base for custom data presentation.
- Framework: utilize cloud infrastructure for reliable deployment — Google App Engine, Amazon EC2
- Framework: make use of front end javascript frameworks — jQuery, ExtJS.
- Framework: make use of back end web frameworks — Pylons
- Implementation: Demo server for displaying sequences plus annotations
- Implementation: Utilizes BioSQL schema, ported to object oriented data store; Google App engine backend or MongoDB backend
- Implementation: Data import/export with Biopython libraries — GenBank in and GFF out
- Implementation: Additional screenshots from internal web displays.
- Challenges: Generalizing and organizing display and retrieval code without having to buy into a large framework.
- Challenges: Re-usable components for cross-language functionality; javascript front end displays for multi-language back ends.
- Challenges: Build a community that thinks of reusing and sharing display code as much as parsing and pipeline development code.
I would be happy to hear comments or suggestions about the talk. If you’re going to BOSC and want to meet up, definitely drop me a line.
Python libraries for synthetic biology
Researchers today are empowered with the ability to devise DNA sequences and order these designs for fabrication. Starting with known pieces from biological systems, variations can be conceptualized and explored with increasing ease. As a result, the field of synthetic biology has expanded along with techniques used to provide larger pieces of synthesized DNA. Companies such as DNA 2.0 and GENEART provide on-demand synthesis services. Additionally, many researchers design, optimize and generate custom DNA sequences in their own lab.
My previous job was at a synthetic biology start up working on the software side of their fabrication and design services. The company was unfortunately a recent victim of the economy. The learning from five years of in the trenches synthesis work will be very useful to others tackling similar projects. With that in mind, the Python libraries used for construction, sequencing verification, and many other synthetic biology related tasks are being made available on Bitbucket.
Some highlights of the library:
- Oligo synthesis — assemble sequences from oligo sized pieces, ordered from a vendor such as IDT; this includes PCR and ligation based assembly methods.
- Cloning synthesis — combine synthesized pieces into larger assemblies.
- Codon optimization — optimize coding sequences using a codon sampling strategy.
- Sequencing analysis — verify synthesized constructs match the expected sequence.
- Calculation of melting temperatures
- Blackwatch — check sequences for known pathogens
The code in this library was integrated as part of an automated synthesis platform. As a result, it does not feature ready to run scripts. Rather, the code can serve as a source for developing these type of tools. I would love to see the most useful parts refactored and integrated into existing software communities like Biopython. Please drop a note if you have any interest in collaborating on such a project.
Evaluating key-value and document stores for short read data
Designing responsive web interfaces for analyzing short read data requires techniques to rapidly retrieve and display all details associated with a read. My own work on this has been relying heavily on Berkeley DB key/value databases. For example, an analysis will have key/value stores relating the read to aligned positions in the genome, counts of reads found in a sequencing run, and other associated metadata.
A recent post by Pierre on storing SNPs in CouchDB encouraged me to evaluate my choice of Berkeley DB for storage. My goals were to move to a network accessible store, and to potentially incorporate the advanced query features associated with document oriented databases. I had found myself embedding too much logic about database location and structure into my code while developing with Berkeley DB.
The main factors considered in evaluating the key/value and document stores for my needs were:
- Network accessible
- Python library support
- Data loading time
- Data query time
- File storage space
- Implementation of queries beyond key/value retrieval
Another relevant consideration, which is not as important to my work but might be to yours, is replicating and distributing a database across multiple servers. These are major issues when developing websites with millions of concurrent users; in science applications I am less likely to find that kind of popularity for short read SNP analysis.
Leonard had recently posted a summary of his experience evaluating distributed key stores, which served as an excellent starting point for looking at the different options out there. I decided to do an in-depth evaluation of three stores:
- Tokyo Cabinet, and its network server Tokyo Tyrant, using the pytyrant library.
- CouchDB, using the couchdb-python library.
- MongoDB, using pymongo.
Tokyo Cabinet is a key/value store which received a number of excellent reviews for its speed and reliability. CouchDB and MongoDB are document oriented stores which offer additional query capabilities.
To evaluate the performance of these three stores, frequency counts for 2.8 million unique reads were loaded. Real stores would have additional details on each read, but the general idea is the same: a large number of relatively small documents. Each of the stores was accessed across the network on a remote machine. The 2.8 million reads were loaded, and then a half million records were retrieved from the database. The python scripts are available on github. The below table summarizes the results:
| Database | Load time | Retrieval time | File size |
|---|---|---|---|
| Tokyo Cabinet/Tyrant | 12 minutes | 3 1/2 minutes | 24MB |
| CouchDB | 5 1/2 minutes | 14 1/2 minutes | 236MB |
| MongoDB | 3 minutes | 4 minutes | 192-960MB |
For CouchDB, I initially reported large numbers which were improved dramatically with some small tweaks. With a naive loading strategy, times were in the range of 22 hours with large 6G files. Thanks to tips from Chris and Paul in the comments, the loading script was modified to use bulk loading. With this change, loading times and file sizes are in the range of the other stores and the new times are reflected in the table. There appear to also be some tweaks that can be made to favor speed over reliability; these tests were done with the standard configuration. The message here is to dig deeper if you find performance issues with CouchDB; small differences in usage can provide huge gains.
Based on the loading tests, I decided to investigate MongoDB and Tokyo Cabinet/Tyrant further. CouchDB loading speeds were improved dramatically by the changes mentioned above; however, retrieval speeds are about 3 times slower. Fetching single records from the database is the most important speed consideration for my work since it happens more frequently than loading, and reflects in the responsiveness of the web front end accessing the database. It is worth investigating whether client code changes can also speed up CouchDB. For Tokyo Cabinet/Tyrant and MongoDB the main performance trade-off was disk space usage for the database files. Tokyo Cabinet loads about 4 times slower, but maintains a much more compact file representation. To better understand how MongoDB stores file, I wrote to the community mailing list and received quick and thoughtful responses on the issue. See the full thread if you are interested in the details; in summary, MongoDB pre-allocated space to improve loading time and this allocation becomes less of an issue as the database size increases.
Looking beyond performance issues, Tokyo Cabinet/Tyrant and MongoDB represent two ends of the storage spectrum. MongoDB is a larger, full featured database providing complex query operations and management of multiple data stores. Tokyo Cabinet and Tyrant provide a lightweight solution for key/value retrieval. Each separate remote Tokyo Cabinet data store requires a Tyrant instance to be running. My work involves generating many different key/value databases for individual short read sequencing projects. To reasonably achieve this with remote Tyrant stores, I would need to develop a server that could start and manage Tyrant instances on demand. Additionally, if my query needs change the key/value paradigm of Tokyo Cabinet would require generating additional key values stores. For instance, we could not readily retrieve reads with a frequency greater than a defined threshold.
In conclusion, both Tokyo Cabinet/Tyrant and MongoDB proved to be excellent solutions for managing the volume and style of data associated with short read experiments. MongoDB provided additional functionality in the form of remote data store management and advanced queries which will be useful for my work; I’ll be switching my Berkeley DB stores over to MongoDB and continuing to explore its capabilities. I would welcome hearing about the solutions others have employed for similar storage and query issues.
