Differential expression analysis with Bioconductor and Python
This post demonstrates performing differential expression analysis of short read sequencing data using a combination of Python and the R statistical language. Python is used a glue language to manipulate and prepare count data from short read sequencing. R and the Bioconductor package are used to perform the statistical analysis. The excellent rpy2 package connection Python and R.
Scientists often need to compare expression results from multiple experimental conditions. This can be done with microarray analysis and the wide variety of associated statistical tests, but many researchers are now utilizing short read sequencing. Experiments identify transcript prevalence through digital gene expression under multiple conditions, and we are interested in extracting statistically meaningful differences from the results. Similarly, we may be interested in differences resulting from short RNA discovery or other next generation sequencing applications.
Short read differential expression analysis differs from microarray analyses as it is based on counts instead of intensity scores. The edgeR package in Bioconductor fits count data to a statistical model considering sample to sample variation, and performs Fisher’s exact test to provide p-values associated with changes in expression between samples.
Here we will consider the case of a two experiment analysis. A simple Example CSV file has counts for 5 different genes under 2 conditions and total count information for the full experiment. We read the file into a dictionary of counts keyed by each condition and the gene name:
import csv import collections def read_count_file(in_file): """Read count information from a simple CSV file into a dictionary. """ counts = collections.defaultdict(dict) with open(in_file) as in_handle: reader = csv.reader(in_handle) header = reader.next() conditions = header[1:] for parts in reader: region_name = parts[0] region_counts = [float(x) for x in parts[1:]] for ci, condition in enumerate(conditions): counts[condition][region_name] = region_counts[ci] return dict(counts)
The dictionary is organized into NumPy matrices; NumPy is a powerful numerical package for Python which integrates smoothly with our rpy2 interface to import directly into R. Here we organize our conditions and genes and then push the count data into a matrix where the columns are conditions and the rows are genes; each item is a count value. This is returned along with associated data:
- groups — A list of experimental groups, which can be used to analyze replicates. Here, two groups with a single replicate each are defined.
- sizes — The total number of counts for each experiment. This is extracted from the “Total” row of the CSV count file, and will be used for normalization of during the statistical analysis.
import numpy def get_conditions_and_genes(work_counts): conditions = work_counts.keys() conditions.sort() all_genes = [] for c in conditions: all_genes.extend(work_counts[c].keys()) all_genes = list(set(all_genes)) all_genes.sort() sizes = [work_counts[c]["Total"] for c in conditions] all_genes.remove("Total") return conditions, all_genes, sizes def edger_matrices(work_counts): conditions, all_genes, sizes = get_conditions_and_genes(work_counts) assert len(sizes) == 2 groups = [1, 2] data = [] final_genes = [] for g in all_genes: cur_row = [int(work_counts[c][g]) for c in conditions] if sum(cur_row) > 0: data.append(cur_row) final_genes.append(g) return (numpy.array(data), numpy.array(groups), numpy.array(sizes), conditions, final_genes)
The organized data is now ready to be pushed directly into an edgeR Bioconductor analysis using the rpy2 interface. Three R functions are called: the data matrices are organized into a DGEList
object, this object is passed to deDGE
which does the actual digital gene expression analysis, and finally topTags
is called to retrieve a vector of differential expression p-values. The vector is translated into a Python object from which we extract the p-values and re-organize them by the initial gene indexes. The ordered p-values are then returned.
import rpy2.robjects as robjects import rpy2.robjects.numpy2ri def run_edger(data, groups, sizes, genes): robjects.r(''' library(edgeR) ''') params = {'group' : groups, 'lib.size' : sizes} dgelist = robjects.r.DGEList(data, **params) ms = robjects.r.deDGE(dgelist, doPoisson=True) tags = robjects.r.topTags(ms, pair=groups, n=len(genes)) indexes = [int(t) - 1 for t in tags.rownames()] pvals = list(tags.r['adj.P.Val'][0]) assert len(indexes) == len(pvals) pvals_w_index = zip(indexes, pvals) pvals_w_index.sort() assert len(pvals_w_index) == len(indexes) return [p for i,p in pvals_w_index]
The final results are written to a CSV file with our ordered genes, conditions and p-value probabilities. Each gene is associated with the initial count data and p-values, the genes are sorted by p-value with the most differentially expressed genes placed first, and the ordered information is written out:
def write_outfile(outfile, genes, conditions, work_counts, probs): with open(outfile, "w") as out_handle: writer = csv.writer(out_handle) writer.writerow(["Region"] + ["%s count" % c for c in conditions] + ["edgeR p-value"]) out_info = [] for i, gene in enumerate(genes): counts = [int(work_counts[c][gene]) for c in conditions] out_info.append((probs[i], [gene] + counts)) out_info.sort() [writer.writerow(start + [prob]) for prob, start in out_info]
Our example output file shows the results. The full script is available for you to use and customize for your own analyses. This example demonstrates the power of the rpy2 interface. Input and output data is happily manipulated in a familiar language such as Python, and can be seamlessly integrated with excellent statistical computation in Bioconductor and other specialized R packages.
Nice showcase !
In run_edger(), one option could be to return the adjusted p-values instead of raw p-values.
That could then be (relying on the column name rather than on its position):
With rpy2-2.1-dev, that’s tentatively simpler:
L.
Laurent Gautier
September 14, 2009 at 11:21 am
Laurent;
Great suggestion. Retrieving by column name makes the code much clearer. I updated the post and code to incorporate this. Thank you for the helpful feedback,
Brad
Brad Chapman
September 15, 2009 at 7:18 am
Great post! Thanks so much! Your practical approach is really useful!
Paolo
September 16, 2009 at 9:15 am
Hi,
script is not working, “topTags” function should not even have “pair” argument. Old version or I am missing something?
Dmitri
Dmitri
October 28, 2009 at 7:28 am
Dmitri;
Thanks for the heads up. It looks like the 1.3 series of edgeR packages changes the interface. An updated version of the script that handles edgeR 1.3+ (R 2.10) and edgeR 1.2 (R 2.9) is in Git:
http://github.com/chapmanb/bcbb/blob/master/stats/count_diffexp.py
Brad
Brad Chapman
October 28, 2009 at 10:19 am
Thanks a lot! Works surprisingly fast – eg ~30,000 genes done under 1 minute!
Dmitri
October 29, 2009 at 7:51 am
Hello,
Thanks for a great post.
Is there a similar code to analyze array CGH data?
Thank you.
SsnChow
February 22, 2010 at 12:02 am
Glad it was helpful. There are several packages in Bioconductor to help with CGH and copy number variation analyses, although I don’t have any direct experience. This recent message from the mailing list might be a good place to start:
http://article.gmane.org/gmane.science.biology.informatics.conductor/26957
Brad
Brad Chapman
February 22, 2010 at 9:17 am
[…] Differential expression analysis with Bioconductor and Python […]
Draft Sequence of Potato Genome Released | MGRC
December 12, 2012 at 4:04 pm
Code is not running.. alot of errors, please help me
Anonymous
May 27, 2020 at 3:10 pm