Blue Collar Bioinformatics

Note: new posts have moved to Please look there for the latest updates and comments

Archive for the ‘analysis’ Category

Differential expression analysis with Bioconductor and Python

with 9 comments

This post demonstrates performing differential expression analysis of short read sequencing data using a combination of Python and the R statistical language. Python is used a glue language to manipulate and prepare count data from short read sequencing. R and the Bioconductor package are used to perform the statistical analysis. The excellent rpy2 package connection Python and R.

Scientists often need to compare expression results from multiple experimental conditions. This can be done with microarray analysis and the wide variety of associated statistical tests, but many researchers are now utilizing short read sequencing. Experiments identify transcript prevalence through digital gene expression under multiple conditions, and we are interested in extracting statistically meaningful differences from the results. Similarly, we may be interested in differences resulting from short RNA discovery or other next generation sequencing applications.

Short read differential expression analysis differs from microarray analyses as it is based on counts instead of intensity scores. The edgeR package in Bioconductor fits count data to a statistical model considering sample to sample variation, and performs Fisher’s exact test to provide p-values associated with changes in expression between samples.

Here we will consider the case of a two experiment analysis. A simple Example CSV file has counts for 5 different genes under 2 conditions and total count information for the full experiment. We read the file into a dictionary of counts keyed by each condition and the gene name:

import csv
import collections

def read_count_file(in_file):
    """Read count information from a simple CSV file into a dictionary.
    counts = collections.defaultdict(dict)
    with open(in_file) as in_handle:
        reader = csv.reader(in_handle)
        header =
        conditions = header[1:]
        for parts in reader:
            region_name = parts[0]
            region_counts = [float(x) for x in parts[1:]]
            for ci, condition in enumerate(conditions):
                counts[condition][region_name] = region_counts[ci]
    return dict(counts)

The dictionary is organized into NumPy matrices; NumPy is a powerful numerical package for Python which integrates smoothly with our rpy2 interface to import directly into R. Here we organize our conditions and genes and then push the count data into a matrix where the columns are conditions and the rows are genes; each item is a count value. This is returned along with associated data:

  • groups — A list of experimental groups, which can be used to analyze replicates. Here, two groups with a single replicate each are defined.
  • sizes — The total number of counts for each experiment. This is extracted from the “Total” row of the CSV count file, and will be used for normalization of during the statistical analysis.

import numpy

def get_conditions_and_genes(work_counts): 
    conditions = work_counts.keys()
    all_genes = []
    for c in conditions:
    all_genes = list(set(all_genes))
    sizes = [work_counts[c]["Total"] for c in conditions]
    return conditions, all_genes, sizes
def edger_matrices(work_counts):
    conditions, all_genes, sizes = get_conditions_and_genes(work_counts)
    assert len(sizes) == 2
    groups = [1, 2]
    data = []
    final_genes = []
    for g in all_genes:
        cur_row = [int(work_counts[c][g]) for c in conditions]
        if sum(cur_row) > 0:
    return (numpy.array(data), numpy.array(groups), numpy.array(sizes),
            conditions, final_genes)

The organized data is now ready to be pushed directly into an edgeR Bioconductor analysis using the rpy2 interface. Three R functions are called: the data matrices are organized into a DGEList object, this object is passed to deDGE which does the actual digital gene expression analysis, and finally topTags is called to retrieve a vector of differential expression p-values. The vector is translated into a Python object from which we extract the p-values and re-organize them by the initial gene indexes. The ordered p-values are then returned.

import rpy2.robjects as robjects
import rpy2.robjects.numpy2ri

def run_edger(data, groups, sizes, genes):
    params = {'group' : groups, 'lib.size' : sizes}
    dgelist = robjects.r.DGEList(data, **params)
    ms = robjects.r.deDGE(dgelist, doPoisson=True)
    tags = robjects.r.topTags(ms, pair=groups, n=len(genes))
    indexes = [int(t) - 1 for t in tags.rownames()]
    pvals = list(tags.r['adj.P.Val'][0])
    assert len(indexes) == len(pvals)
    pvals_w_index = zip(indexes, pvals)
    assert len(pvals_w_index) == len(indexes)
    return [p for i,p in pvals_w_index]

The final results are written to a CSV file with our ordered genes, conditions and p-value probabilities. Each gene is associated with the initial count data and p-values, the genes are sorted by p-value with the most differentially expressed genes placed first, and the ordered information is written out:

def write_outfile(outfile, genes, conditions, work_counts, probs):
    with open(outfile, "w") as out_handle:
        writer = csv.writer(out_handle)
        writer.writerow(["Region"] +
                ["%s count" % c for c in conditions] + ["edgeR p-value"])
        out_info = []
        for i, gene in enumerate(genes):
            counts = [int(work_counts[c][gene]) for c in conditions]
            out_info.append((probs[i], [gene] + counts))
        [writer.writerow(start + [prob]) for prob, start in out_info]

Our example output file shows the results. The full script is available for you to use and customize for your own analyses. This example demonstrates the power of the rpy2 interface. Input and output data is happily manipulated in a familiar language such as Python, and can be seamlessly integrated with excellent statistical computation in Bioconductor and other specialized R packages.

Written by Brad Chapman

September 13, 2009 at 8:28 pm

Organization of literature using PubMed related articles

leave a comment »

When dealing with a long list of journal articles, what is the best method to organize them? I was confronted with this problem in designing an interface where users would pick specific papers and retrieve results tied to them. Presenting them as the raw list was unsatisfying; it is fine for users who know exactly what articles they want, but naive users would have a lot of difficulty finding relevant articles. Even for power users, a better classification system could help reveal items they may not have known about.

The approach I took was to group together papers based on similarity. The NCBI PubMed literature database has links to related articles, which it exposes programmatically though EUtils. Using the Biopython Entrez interface, the first step is to retrieve a dictionary of related IDs for each starting article, ordered by relevance:

def _get_elink_related_ids(self, pmids):
    pmid_related = {}
    for pmid in pmids:
        handle = Entrez.elink(dbform='pubmed', db='pubmed', id=pmid)
        record =
        cur_ids = []
        for link_dict in record[0]['LinkSetDb'][0]['Link']:
            cur_ids.append((int(link_dict.get('Score', 0)),
        local_ids = [x[1] for x in cur_ids if x[1] in pmids]
        if pmid in local_ids:
        pmid_related[pmid] = local_ids
    return pmid_related

Trying to group directly based on this dictionary will often result in one large group, since many of the articles may be linked together through a few common articles. For instance, a review may be related to several other papers in non-overlapping areas. To make the results as useful as possible we define a maximum and minimum group size, and a two parameters to filter the related lists:

  • overrep_thresh: The percentage of papers an item is related to out of all papers being grouped; the threshold sets a maximum number of papers that can be related. For instance, a value of .25 means that a journal will be related to 25% or less of the total papers.
  • related_max: The number of related papers to use. The best related articles go into the grouping.

These parameters define a filter for our dictionary of related articles:

def _filter_related(self, inital_dict, overrep_thresh, related_max):
    final_dict = {}
    all_vals = reduce(operator.add, inital_dict.values())
    for item_id, item_vals in inital_dict.items():
        final_vals = [val for val in item_vals if 
            float(all_vals.count(val)) / len(inital_dict) <= overrep_thresh&#93;
        final_dict&#91;item_id&#93; = final_vals&#91;:related_max&#93;
    return final_dict

The filtered list is grouped using a generalized version of the <code>examine_paralogs</code> function used in an earlier post to group together <a href="">location and duplication information</a>. Sets combine any groups with overlapping articles:

def _groups_from_related_dict(self, related_dict):
    cur_groups = []
    all_base = related_dict.keys()
    for base_id, cur_ids in related_dict.items():
        overlap = set(cur_ids) & set(all_base)
        if len(overlap) > 0:
            new_group = set(overlap | set([base_id]))
            is_unique = True
            for exist_i, exist_group in enumerate(cur_groups):
                if len(new_group & exist_group) > 0:
                    update_group = new_group | exist_group
                    cur_groups[exist_i] = update_group
                    is_unique = False
            if is_unique:
    return [list(g) for g in cur_groups]

With this list, we want to extract the groups and their articles that fit in our grouping criteria, the minimum and maximum size:

def _collect_new_groups(self, pmid_related, groups):
final_groups = []
for group_items in groups:
final_items = [i for i in group_items if pmid_related.has_key(i)]
if (len(final_items) >= self._min_group and
len(final_items) <= self._max_group): final_groups.append(final_items) for item in final_items: del pmid_related[item] final_related_dict = {} for pmid, related in pmid_related.items(): final_related = [r for r in related if pmid_related.has_key(r)] final_related_dict[pmid] = final_related return final_groups, final_related_dict [/sourcecode]

Utilizing these functions, the main algorithm steps through a series of increasingly less stringent parameters picking out groups which fall into our thresholds. Closely related journal articles are grouped first; more general papers with less association will be placed in groups in later rounds:

def get_pmid_groups(self, pmids):
pmid_related = self._get_elink_related_ids(pmids)
filter_params = self._filter_params[:]
final_groups = []
while len(pmid_related) > 0:
if len(filter_params) == 0:
raise ValueError(“Ran out of parameters before finding groups”)
cur_thresh, cur_related = filter_params.pop(0)
while 1:
filt_related = self._filter_related(pmid_related, cur_thresh,
groups = self._groups_from_related_dict(filt_related)
new_groups, pmid_related = self._collect_new_groups(
pmid_related, groups)
if len(new_groups) == 0:
if len(pmid_related) < self._max_group: final_groups.append(pmid_related.keys()) pmid_related = {} return final_groups [/sourcecode]

The full code wrapped up into a class is available from the GitHub repository.

This is one approach to automatically grouping a large list of literature to make interesting items more discoverable. With the work being done on full text indexing, the data underlying resources such as iHOP might be used to do these groupings even more effectively using similar algorithms.

Written by Brad Chapman

February 13, 2009 at 6:19 am