Blue Collar Bioinformatics

Note: new posts have moved to http://bcb.io/ Please look there for the latest updates and comments

Gene Ontology analysis with Python and Bioconductor

with 9 comments

Last post we discussed extracting differentially expressed genes from a set of short read data. The output of this analysis is a large list of genes, and the next challenge is to try and extract interesting patterns from the results. Are genes in particular families over-represented? Are there a large number of genes with a particular function? These patterns can help direct your future experiments and provide an understandable grouping for biologists examining the data.

The Gene Ontology (GO) project provides a standardized set of terms describing the molecular function of genes. We will use the topGO package from the Bioconductor project to identify over-represented GO terms from a set of differentially expressed genes. Python will be used to prepare the data, utilizing rpy2 to call R for the statistical analysis.

The first step is to parse input files describing the differentially expressed genes and the mapping of gene names to GO terms. For the example we will use a simple CSV file from our previous analysis and an equally simple file describing the gene to GO mapping. For real life work, you will need to get the GO mapping file for your organism; for instance, here is the Arabidopsis GO mapping from TAIR.

The input file of differentially expressed genes is parsed into a dictionary where the keys are gene names and the values are associated p-values:

import csv

def parse_input_csv(in_handle):
    reader = csv.reader(in_handle)
    reader.next() # header
    all_genes = dict()
    for (gene_name, _, _, pval) in reader:
        all_genes[gene_name] = float(pval)
    return all_genes

The GO mapping input file is parsed into two dictionaries: one mapping genes to GO terms for the GO analysis, and the second mapping GO terms to genes for displaying the results of the analysis:

import collections

def parse_go_map_file(in_handle, genes_w_pvals):
    gene_list = genes_w_pvals.keys()
    gene_to_go = collections.defaultdict(list)
    go_to_gene = collections.defaultdict(list)
    for line in in_handle:
        parts = line.split("\t")
        gene_id = parts[0]
        go_id = parts[1].strip()
        if gene_id in gene_list:
            gene_to_go[gene_id].append(go_id)
            go_to_gene[go_id].append(gene_id)
    return dict(gene_to_go), dict(go_to_gene)

Next we set up our associated R environment for GO analysis. This involves loading the topGO library and creating a function to classify genes as either differentially expressed, or not. We do this based on the p-value, classifying genes with a p-value of 0.01 or less as differentially expressed:

import rpy2.robjects as robjects

gene_pval = 1e-2
robjects.r('''
    library(topGO)
''')
robjects.r('''
    topDiffGenes = function(allScore) {
      return (allScore < %s)
    }
''' % gene_pval)

Now it’s time to run the analysis. The paramters are defined as a dictionary which initializes a topGOdata object:

  • ontology — the type of GO ontology to analyze; molecular function (MF) is examined here.
  • annot — how the GO terms are annotated; we specify they will be identified by a mapping of genes to GO ids, passed as the gene2GO argument.
  • geneSelectionFun — the function we defined above to determine if genes are differentially expressed.
  • allGenes — an R named vector where the names are genes and the values are p-values; we use a utility function to convert a python dictionary to the named vector.
  • gene2GO — an R named vector mapping gene names to associated GO terms.

The initialized object is run using Fisher classic analysis. Several different analysis methods and test statistics are available in the package. The resulting scores are extracted and a summary table of the results is generated:

def _dict_to_namedvector(init_dict):
    """Call R to create a named vector from an input dictionary.
    """
    return robjects.r.c(**init_dict)

go_term_type = "MF"
topgo_method = "classic" # choice of classic, elim, weight
params = {"ontology" : go_term_type,
          "annot" : robjects.r["annFUN.gene2GO"],
          "geneSelectionFun" : robjects.r["topDiffGenes"],
          "allGenes" : _dict_to_namedvector(gene_vals),
          "gene2GO" : _dict_to_namedvector(gene_to_go)
          }
go_data = robjects.r.new("topGOdata", **params)
results = robjects.r.runTest(go_data, algorithm=topgo_method,
        statistic="fisher")
scores = robjects.r.score(results)
score_names = scores.getnames()
num_summarize = min(100, len(score_names))
results_table = robjects.r.GenTable(go_data, elimFisher=results,
        orderBy="elimFisher", topNodes=num_summarize)

The scores and results table are used to generate a list of over-represented GO terms with their associated p-values and descriptions:

# extract term names from the topGO summary dataframe
GO_ID_INDEX = 0
TERM_INDEX = 1
ids_to_terms = dict()
for index, go_id in enumerate(results_table[GO_ID_INDEX]):
    ids_to_terms[go_id] = results_table[TERM_INDEX][index]
go_terms = []
# convert the scores and results information info terms to return
for index, item in enumerate(scores):
    if item < go_pval:
        go_id = score_names[index]
        go_terms.append((item, go_id, ids_to_terms.get(go_id, "")))
go_terms.sort()

Finally, we print out the resulting overexpressed GO terms along with associated genes:

def print_go_info(go_terms, go_term_type, go_to_gene):
    for final_pval, go_id, go_term in go_terms:
        genes = []
        for check_go in [go_id] + get_go_children(go_id, go_term_type):
            genes.extend(go_to_gene.get(check_go, []))
        genes = sorted(list(set(genes)))
        print "-> %s (%s) : %0.4f" % (go_id, go_term, final_pval)
        for g in genes:
            print g

One tricky part of the post-analysis processing is finding all of the genes associated with an identified GO id. The GO terminology is hierarchical, so a resulting GO identifier like GO:0042578 in the example results file can represent both that term and more specific terms. A nice way to illustrate this for yourself is to look at the AmiGO browser display for the term.

The result is that genes associated with an over-represented term will come from both the identified term and more specific terms. Bioconductor has an interface to the GO database, so we can extract all of the relevant terms by moving down the GO tree collecting the more specific children, given the parent term:

def get_go_children(go_term, go_term_type):
    robjects.r('''
        library(GO.db)
    ''')
    child_map = robjects.r["GO%sCHILDREN" % (go_term_type)]
    children = []
    to_check = [go_term]
    while len(to_check) > 0:
        new_children = []
        for check_term in to_check:
            new_children.extend(list(robjects.r.get(check_term, child_map)))
        new_children = list(set([c for c in new_children if c]))
        children.extend(new_children)
        to_check = new_children
    children = list(set(children))
    return children

The full script pulls all of this code together into a working example. You will need to update the parsers to match your input and GO mapping file, but otherwise the code can plug directly into full genome analyses of differential expression data.

Written by Brad Chapman

October 18, 2009 at 11:03 am

9 Responses

Subscribe to comments with RSS.

  1. Dear Brad Chapman,
    Python and R is fantastic combination: the pleasantness of Python together with the amazing power of R and it’s packages. I’ve been using this combination for bioinformatics too, but very probably with fewer familiarity than you. Sometimes I ask myself how robust would be implementing some task with Python/Rpy instead of directly in R. I am sure your posts make Python’s reputation stronger. Thanks for sharing your codes.

    Frederico Arnoldi

    May 11, 2010 at 3:27 pm

  2. Dear Brad,

    Thank you for your great post.

    I have try your script and examples but failed(Please see it as following). Could you tell me how to fix it? Thank you in advance.

    $python diffexp_go_analysis.py diffexp_example-diffs.csv ATH_GO_GOSLIM.txt

    Loading required package: graph
    Loading required package: Biobase

    Welcome to Bioconductor

    Vignettes contain introductory material. To view, type
    ‘openVignette()’. To cite Bioconductor, see
    ‘citation(“Biobase”)’ and for packages ‘citation(pkgname)’.

    Loading required package: GO.db
    Loading required package: AnnotationDbi
    Loading required package: DBI
    Loading required package: SparseM
    Package SparseM (0.85) loaded.
    To cite, see citation(“SparseM”)

    Attaching package: ‘SparseM’

    The following object(s) are masked from ‘package:base’:

    backsolve

    groupGOTerms: GOBPTerm, GOMFTerm, GOCCTerm environments built.

    Building most specific GOs …..Error in split.default(geneID[goodGO], allGO[goodGO]) :
    first argument must be a vector
    In addition: Warning message:
    In is.na(gene2GO) : is.na() applied to non-(list or vector) of type ‘NULL’
    Traceback (most recent call last):
    File “diffexp_go_analysis.py”, line 130, in
    main(sys.argv[1], sys.argv[2])
    File “diffexp_go_analysis.py”, line 25, in main
    gene_pval, go_pval, topgo_method)
    File “diffexp_go_analysis.py”, line 81, in run_topGO
    go_data = robjects.r.new(“topGOdata”, **params)
    File “/usr/lib64/python2.6/site-packages/rpy2/robjects/__init__.py”, line 423, in __call__
    res = super(RFunction, self).__call__(*new_args, **new_kwargs)
    rinterface.RRuntimeError: Error in split.default(geneID[goodGO], allGO[goodGO]) :
    first argument must be a vector

    Zhidong

    July 15, 2010 at 4:20 am

    • Zhidong;
      You’re getting this message because none of the terms in your GO mapping file (ATH_GO_GOSLIM.txt) match the example file (diffexp_example-diffs.csv). You’ll need to supply a real list of terms for the first argument and be sure the names match what is in your GO file. You’ll also need to modify the parse_go_map_file function so you can pull out the Arabidopsis gene names (1st column) and GO terms (6th column).

      I updated the code so the error message in this case is more useful and to make it work on the latest rpy2 version. Hope this helps get things working for you.

      Brad Chapman

      July 15, 2010 at 10:20 am

  3. Hi Brad,

    That is a naive problem for I know little about python and now I have run it successfully following your reply. Thank you very much.

    Zhidong

    July 16, 2010 at 9:34 am

  4. Hi Brad,

    Nice example of GO analysis. However, you can save yourself a bit of work at the end. In each organism specific database in Bioconductor, there is a wrapper to return the mapping of all genes (by Entrez ID) to that particular GO term, thereby saving you the work of traversing the tree.

    For example, with the “org.Rn.eg.db” database, you can use: org.Rn.egGO2ALLEGS to return all genes annotated to your terms of interest.

    Robert

    July 29, 2010 at 1:01 pm

  5. Dear Brad,
    If I have the predicted target genes of miRNAs with no P values of difference expression, how could I do the functional enrichment analysis of these target genes?

    Zhidong

    August 3, 2010 at 7:55 pm

    • Zhidong;
      The purpose of the p-value is to select the enriched genes from the background. If you have some other method of identifying the enriched genes, assign them proxy p-values that differentiate them from background. The topDiffGenes function should then be modified to select out the targets you are interested in, and the rest of the analysis can proceed as normal. Hope this helps.

      Brad Chapman

      August 4, 2010 at 6:03 am

  6. […] Gene Ontology analysis with Python and Bioconductor « Blue Collar Bioinformatics […]


Leave a comment