Blue Collar Bioinformatics

Note: new posts have moved to Please look there for the latest updates and comments

Posts Tagged ‘R

Automated retrieval of expression data with Python and R

with 10 comments

In February, I’ll be headed to Japan for BioHackathon 2010, which focuses on integrating biological datasets. In preparation, I’ve been thinking about use cases: when working with a set of data, what other information do you want to able to readily retrieve that will help you better answer your questions? One common set of data is gene expression information; you identify a group of genes of interest and are curious about their expression levels.

In an ideal semantic world, you would be able to build a query through a set of RDF triples like:

  • Expression set -> has organism -> Mus musculus
  • Mus musculus -> has cell type -> proB
  • proB -> has biological state -> wild type
  • Expression set -> has identifier -> A RefGene ID
  • A RefGene ID -> has expression value -> 7.65

So you could find expression sets unique to your cell type under standard conditions, and then identify the expression values of your genes. Ideally, expression sets would also have additional meta-data associated with them so you’d know what 7.65 meant in the context of the experiment.

Although we can’t do this query through semantic channels, there are a number of toolkits that help us answer the question in an automated way. NCBI’s Gene Expression Omnibus (GEO) provides an online resource hosting expression data. You can query the data sets with EUtils web API using Biopython. Once a data set of interest is identified, Bioconductor’s GEOquery can retrieve the data sets and mappings to UCSC RefGene identifiers. All of this is tied together with the Rpy2 python interface to R.

At the top level, we define our goal, which is to retrieve expression data for mouse proB wild type cells:

def main():
    organism = "Mus musculus"
    cell_types = ["proB", "ProB", "pro-B"]
    email = ""
    save_dir = os.getcwd()
    exp_data = get_geo_data(organism, cell_types, email, save_dir,

def _is_wild_type(result):
    """Check if a sample is wild type from the title.
    return result.samples[0][0].startswith("WT")

We query GEO for the possible data sets based on our organism and cell type, and then search through the results for a dataset that matches our criteria. In real life work, your personal _is_wild_type function may be an interactive dialog that presents the biologist with possible experiments, allowing them to select the one of interest. Note also that we store our final result locally, as a Python pickle. Practically, this reduces our load on external servers by only querying them once:

def get_geo_data(organism, cell_types, email, save_dir, is_desired_result):
    save_file = os.path.join(save_dir, "%s-results.pkl" % cell_types[0])
    if not os.path.exists(save_file):
        results = cell_type_gsms(organism, cell_types, email)
        for result in results:
            if is_desired_result(result):
                with open(save_file, "w") as out_handle:
                    cPickle.dump(result, out_handle)

    with open(save_file) as save_handle:
        result = cPickle.load(save_handle)

The hard work of querying GEO and retrieving the results is done by Biopython’s Entrez interface. After building up the query, the results are parsed into simple objects with a description of the expression set along with titles and identifiers for each of the samples that match our cell type:

def cell_type_gsms(organism, cell_types, email):
    """Use Entrez to retrieve GEO entries for an organism and cell type.
    """ = email
    search_term = "%s[ORGN] %s" % (organism, " OR ".join(cell_types))
    print "Searching GEO and retrieving results: %s" % search_term
    hits = []
    handle = Entrez.esearch(db="gds", term=search_term)
    results =
    for geo_id in results['IdList']:
        handle = Entrez.esummary(db="gds", id=geo_id)
        summary =
        samples = []
        for sample in summary[0]['Samples']:
            for cell_type in cell_types:
                if sample['Title'].find(cell_type) >= 0:
                    samples.append((sample['Title'], sample['Accession']))
        if len(samples) > 0:
            hits.append(GEOResult(summary[0]['summary'], samples))
    return hits

Once the experiment is selected we’ve accomplished the first part of our goal. Next, the plan is to get expression data values. We do this by building up a dictionary that maps RefGene IDs to expression values, so the results look like:

WT ProB, biological rep1 
[('NM_177327', [7.7430266269999999, 6.4795213670000003, 8.8766985500000004]), 
 ('NM_001008785', [7.4671954649999996, 5.4761453329999998]),
 ('NM_177325', [7.3312364040000002, 11.831475960000001]),
 ('NM_177323', [6.9779868059999997, 6.3926399939999996]),
 ('NM_177322', [5.0833683379999997])]

The actual hard work is retrieving the expression data and mappings to gene identifiers, and this is done using the GEOquery library in Bioconductor. Mapping information for expression values, and high level meta-data about the experiment, are stored in local files. The expression information is stored as a R style tab delimited data table, while the meta-data is stored in JSON format. Both sets of data are then present locally in readily readable formats for further processing:

def _write_gsm_map(self, gsm_id, meta_file, table_file):
    """Retrieve GEO expression values using Bioconductor, saving to a table.
    robjects.r.assign("", gsm_id)
    robjects.r.assign("table.file", table_file)
    robjects.r.assign("meta.file", meta_file)
      gsm <- getGEO(
      write.table(Table(gsm), file = table.file, sep = "\t", row.names = FALSE,
                  col.names = TRUE)
      cat(toJSON(Meta(gsm)), file = meta.file)

The previous function provides a mapping between probe names and expression levels. To be useful, we also need to map each of the probes on the expression array to biological gene identifiers. This is done through a second call to the GEO library to retrieve details for the expression platform. Again the data is presented in a R data frame, which we limit to only the items of interest and save as a tab delimited file:

def _write_gpl_map(self, gpl_id, gpl_file):
    """Retrieve GEO platform data using R and save to a table.
    robjects.r.assign("", gpl_id)
    robjects.r.assign("gpl.file", gpl_file)
      gpl <- getGEO( <- subset(Table(gpl), select=c("ID", "RefSeq.Transcript.ID"))
      write.table(, file = gpl.file, sep = "\t", row.names = FALSE,
                  col.names = TRUE)

Putting this all together, we download each of the mapping files and then parse them, building the connection: expression to probe ID to gene transcript IDs. The final dictionary maps these gene identifiers to the expression values and returns them:

def get_gsm_tx_values(self, gsm_id, save_dir):
    """Retrieve a map of transcripts to expression from a GEO GSM file.
    gsm_meta_file = os.path.join(save_dir, "%s-meta.txt" % gsm_id)
    gsm_table_file = os.path.join(save_dir, "%s-table.txt" % gsm_id)
    if (not os.path.exists(gsm_meta_file) or 
            not os.path.exists(gsm_table_file)):
        self._write_gsm_map(gsm_id, gsm_meta_file, gsm_table_file)

    with open(gsm_meta_file) as in_handle:
        gsm_meta = json.load(in_handle)
    id_to_tx = self.get_gpl_map(gsm_meta['platform_id'], save_dir)
    tx_to_vals = collections.defaultdict(list)
    with open(gsm_table_file) as in_handle:
        reader = csv.reader(in_handle, dialect='excel-tab') # header
        for probe_id, probe_val in reader:
            for tx_id in id_to_tx.get(probe_id, []):
    return tx_to_vals

The full script puts all of these parts together into a working example. In the end we go from our query to the expression data we need, although the route is a bit more circuitous then the idealized one described at the start of the post. Simplifying this type of question based approach to data access is one challenge bioinformatics tool developers will continue to face.

Written by Brad Chapman

January 2, 2010 at 5:43 pm

Gene Ontology analysis with Python and Bioconductor

with 9 comments

Last post we discussed extracting differentially expressed genes from a set of short read data. The output of this analysis is a large list of genes, and the next challenge is to try and extract interesting patterns from the results. Are genes in particular families over-represented? Are there a large number of genes with a particular function? These patterns can help direct your future experiments and provide an understandable grouping for biologists examining the data.

The Gene Ontology (GO) project provides a standardized set of terms describing the molecular function of genes. We will use the topGO package from the Bioconductor project to identify over-represented GO terms from a set of differentially expressed genes. Python will be used to prepare the data, utilizing rpy2 to call R for the statistical analysis.

The first step is to parse input files describing the differentially expressed genes and the mapping of gene names to GO terms. For the example we will use a simple CSV file from our previous analysis and an equally simple file describing the gene to GO mapping. For real life work, you will need to get the GO mapping file for your organism; for instance, here is the Arabidopsis GO mapping from TAIR.

The input file of differentially expressed genes is parsed into a dictionary where the keys are gene names and the values are associated p-values:

import csv

def parse_input_csv(in_handle):
    reader = csv.reader(in_handle) # header
    all_genes = dict()
    for (gene_name, _, _, pval) in reader:
        all_genes[gene_name] = float(pval)
    return all_genes

The GO mapping input file is parsed into two dictionaries: one mapping genes to GO terms for the GO analysis, and the second mapping GO terms to genes for displaying the results of the analysis:

import collections

def parse_go_map_file(in_handle, genes_w_pvals):
    gene_list = genes_w_pvals.keys()
    gene_to_go = collections.defaultdict(list)
    go_to_gene = collections.defaultdict(list)
    for line in in_handle:
        parts = line.split("\t")
        gene_id = parts[0]
        go_id = parts[1].strip()
        if gene_id in gene_list:
    return dict(gene_to_go), dict(go_to_gene)

Next we set up our associated R environment for GO analysis. This involves loading the topGO library and creating a function to classify genes as either differentially expressed, or not. We do this based on the p-value, classifying genes with a p-value of 0.01 or less as differentially expressed:

import rpy2.robjects as robjects

gene_pval = 1e-2
    topDiffGenes = function(allScore) {
      return (allScore < %s)
''' % gene_pval)

Now it’s time to run the analysis. The paramters are defined as a dictionary which initializes a topGOdata object:

  • ontology — the type of GO ontology to analyze; molecular function (MF) is examined here.
  • annot — how the GO terms are annotated; we specify they will be identified by a mapping of genes to GO ids, passed as the gene2GO argument.
  • geneSelectionFun — the function we defined above to determine if genes are differentially expressed.
  • allGenes — an R named vector where the names are genes and the values are p-values; we use a utility function to convert a python dictionary to the named vector.
  • gene2GO — an R named vector mapping gene names to associated GO terms.

The initialized object is run using Fisher classic analysis. Several different analysis methods and test statistics are available in the package. The resulting scores are extracted and a summary table of the results is generated:

def _dict_to_namedvector(init_dict):
    """Call R to create a named vector from an input dictionary.
    return robjects.r.c(**init_dict)

go_term_type = "MF"
topgo_method = "classic" # choice of classic, elim, weight
params = {"ontology" : go_term_type,
          "annot" : robjects.r["annFUN.gene2GO"],
          "geneSelectionFun" : robjects.r["topDiffGenes"],
          "allGenes" : _dict_to_namedvector(gene_vals),
          "gene2GO" : _dict_to_namedvector(gene_to_go)
go_data ="topGOdata", **params)
results = robjects.r.runTest(go_data, algorithm=topgo_method,
scores = robjects.r.score(results)
score_names = scores.getnames()
num_summarize = min(100, len(score_names))
results_table = robjects.r.GenTable(go_data, elimFisher=results,
        orderBy="elimFisher", topNodes=num_summarize)

The scores and results table are used to generate a list of over-represented GO terms with their associated p-values and descriptions:

# extract term names from the topGO summary dataframe
ids_to_terms = dict()
for index, go_id in enumerate(results_table[GO_ID_INDEX]):
    ids_to_terms[go_id] = results_table[TERM_INDEX][index]
go_terms = []
# convert the scores and results information info terms to return
for index, item in enumerate(scores):
    if item < go_pval:
        go_id = score_names[index]
        go_terms.append((item, go_id, ids_to_terms.get(go_id, "")))

Finally, we print out the resulting overexpressed GO terms along with associated genes:

def print_go_info(go_terms, go_term_type, go_to_gene):
    for final_pval, go_id, go_term in go_terms:
        genes = []
        for check_go in [go_id] + get_go_children(go_id, go_term_type):
            genes.extend(go_to_gene.get(check_go, []))
        genes = sorted(list(set(genes)))
        print "-> %s (%s) : %0.4f" % (go_id, go_term, final_pval)
        for g in genes:
            print g

One tricky part of the post-analysis processing is finding all of the genes associated with an identified GO id. The GO terminology is hierarchical, so a resulting GO identifier like GO:0042578 in the example results file can represent both that term and more specific terms. A nice way to illustrate this for yourself is to look at the AmiGO browser display for the term.

The result is that genes associated with an over-represented term will come from both the identified term and more specific terms. Bioconductor has an interface to the GO database, so we can extract all of the relevant terms by moving down the GO tree collecting the more specific children, given the parent term:

def get_go_children(go_term, go_term_type):
    child_map = robjects.r["GO%sCHILDREN" % (go_term_type)]
    children = []
    to_check = [go_term]
    while len(to_check) > 0:
        new_children = []
        for check_term in to_check:
            new_children.extend(list(robjects.r.get(check_term, child_map)))
        new_children = list(set([c for c in new_children if c]))
        to_check = new_children
    children = list(set(children))
    return children

The full script pulls all of this code together into a working example. You will need to update the parsers to match your input and GO mapping file, but otherwise the code can plug directly into full genome analyses of differential expression data.

Written by Brad Chapman

October 18, 2009 at 11:03 am

Differential expression analysis with Bioconductor and Python

with 10 comments

This post demonstrates performing differential expression analysis of short read sequencing data using a combination of Python and the R statistical language. Python is used a glue language to manipulate and prepare count data from short read sequencing. R and the Bioconductor package are used to perform the statistical analysis. The excellent rpy2 package connection Python and R.

Scientists often need to compare expression results from multiple experimental conditions. This can be done with microarray analysis and the wide variety of associated statistical tests, but many researchers are now utilizing short read sequencing. Experiments identify transcript prevalence through digital gene expression under multiple conditions, and we are interested in extracting statistically meaningful differences from the results. Similarly, we may be interested in differences resulting from short RNA discovery or other next generation sequencing applications.

Short read differential expression analysis differs from microarray analyses as it is based on counts instead of intensity scores. The edgeR package in Bioconductor fits count data to a statistical model considering sample to sample variation, and performs Fisher’s exact test to provide p-values associated with changes in expression between samples.

Here we will consider the case of a two experiment analysis. A simple Example CSV file has counts for 5 different genes under 2 conditions and total count information for the full experiment. We read the file into a dictionary of counts keyed by each condition and the gene name:

import csv
import collections

def read_count_file(in_file):
    """Read count information from a simple CSV file into a dictionary.
    counts = collections.defaultdict(dict)
    with open(in_file) as in_handle:
        reader = csv.reader(in_handle)
        header =
        conditions = header[1:]
        for parts in reader:
            region_name = parts[0]
            region_counts = [float(x) for x in parts[1:]]
            for ci, condition in enumerate(conditions):
                counts[condition][region_name] = region_counts[ci]
    return dict(counts)

The dictionary is organized into NumPy matrices; NumPy is a powerful numerical package for Python which integrates smoothly with our rpy2 interface to import directly into R. Here we organize our conditions and genes and then push the count data into a matrix where the columns are conditions and the rows are genes; each item is a count value. This is returned along with associated data:

  • groups — A list of experimental groups, which can be used to analyze replicates. Here, two groups with a single replicate each are defined.
  • sizes — The total number of counts for each experiment. This is extracted from the “Total” row of the CSV count file, and will be used for normalization of during the statistical analysis.

import numpy

def get_conditions_and_genes(work_counts): 
    conditions = work_counts.keys()
    all_genes = []
    for c in conditions:
    all_genes = list(set(all_genes))
    sizes = [work_counts[c]["Total"] for c in conditions]
    return conditions, all_genes, sizes
def edger_matrices(work_counts):
    conditions, all_genes, sizes = get_conditions_and_genes(work_counts)
    assert len(sizes) == 2
    groups = [1, 2]
    data = []
    final_genes = []
    for g in all_genes:
        cur_row = [int(work_counts[c][g]) for c in conditions]
        if sum(cur_row) > 0:
    return (numpy.array(data), numpy.array(groups), numpy.array(sizes),
            conditions, final_genes)

The organized data is now ready to be pushed directly into an edgeR Bioconductor analysis using the rpy2 interface. Three R functions are called: the data matrices are organized into a DGEList object, this object is passed to deDGE which does the actual digital gene expression analysis, and finally topTags is called to retrieve a vector of differential expression p-values. The vector is translated into a Python object from which we extract the p-values and re-organize them by the initial gene indexes. The ordered p-values are then returned.

import rpy2.robjects as robjects
import rpy2.robjects.numpy2ri

def run_edger(data, groups, sizes, genes):
    params = {'group' : groups, 'lib.size' : sizes}
    dgelist = robjects.r.DGEList(data, **params)
    ms = robjects.r.deDGE(dgelist, doPoisson=True)
    tags = robjects.r.topTags(ms, pair=groups, n=len(genes))
    indexes = [int(t) - 1 for t in tags.rownames()]
    pvals = list(tags.r['adj.P.Val'][0])
    assert len(indexes) == len(pvals)
    pvals_w_index = zip(indexes, pvals)
    assert len(pvals_w_index) == len(indexes)
    return [p for i,p in pvals_w_index]

The final results are written to a CSV file with our ordered genes, conditions and p-value probabilities. Each gene is associated with the initial count data and p-values, the genes are sorted by p-value with the most differentially expressed genes placed first, and the ordered information is written out:

def write_outfile(outfile, genes, conditions, work_counts, probs):
    with open(outfile, "w") as out_handle:
        writer = csv.writer(out_handle)
        writer.writerow(["Region"] +
                ["%s count" % c for c in conditions] + ["edgeR p-value"])
        out_info = []
        for i, gene in enumerate(genes):
            counts = [int(work_counts[c][gene]) for c in conditions]
            out_info.append((probs[i], [gene] + counts))
        [writer.writerow(start + [prob]) for prob, start in out_info]

Our example output file shows the results. The full script is available for you to use and customize for your own analyses. This example demonstrates the power of the rpy2 interface. Input and output data is happily manipulated in a familiar language such as Python, and can be seamlessly integrated with excellent statistical computation in Bioconductor and other specialized R packages.

Written by Brad Chapman

September 13, 2009 at 8:28 pm