Blue Collar Bioinformatics

Note: new posts have moved to http://bcb.io/ Please look there for the latest updates and comments

Automated retrieval of expression data with Python and R

with 10 comments

In February, I’ll be headed to Japan for BioHackathon 2010, which focuses on integrating biological datasets. In preparation, I’ve been thinking about use cases: when working with a set of data, what other information do you want to able to readily retrieve that will help you better answer your questions? One common set of data is gene expression information; you identify a group of genes of interest and are curious about their expression levels.

In an ideal semantic world, you would be able to build a query through a set of RDF triples like:

  • Expression set -> has organism -> Mus musculus
  • Mus musculus -> has cell type -> proB
  • proB -> has biological state -> wild type
  • Expression set -> has identifier -> A RefGene ID
  • A RefGene ID -> has expression value -> 7.65

So you could find expression sets unique to your cell type under standard conditions, and then identify the expression values of your genes. Ideally, expression sets would also have additional meta-data associated with them so you’d know what 7.65 meant in the context of the experiment.

Although we can’t do this query through semantic channels, there are a number of toolkits that help us answer the question in an automated way. NCBI’s Gene Expression Omnibus (GEO) provides an online resource hosting expression data. You can query the data sets with EUtils web API using Biopython. Once a data set of interest is identified, Bioconductor’s GEOquery can retrieve the data sets and mappings to UCSC RefGene identifiers. All of this is tied together with the Rpy2 python interface to R.

At the top level, we define our goal, which is to retrieve expression data for mouse proB wild type cells:

def main():
    organism = "Mus musculus"
    cell_types = ["proB", "ProB", "pro-B"]
    email = "chapmanb@50mail.com"
    save_dir = os.getcwd()
    exp_data = get_geo_data(organism, cell_types, email, save_dir,
        _is_wild_type)

def _is_wild_type(result):
    """Check if a sample is wild type from the title.
    """
    return result.samples[0][0].startswith("WT")

We query GEO for the possible data sets based on our organism and cell type, and then search through the results for a dataset that matches our criteria. In real life work, your personal _is_wild_type function may be an interactive dialog that presents the biologist with possible experiments, allowing them to select the one of interest. Note also that we store our final result locally, as a Python pickle. Practically, this reduces our load on external servers by only querying them once:

def get_geo_data(organism, cell_types, email, save_dir, is_desired_result):
    save_file = os.path.join(save_dir, "%s-results.pkl" % cell_types[0])
    if not os.path.exists(save_file):
        results = cell_type_gsms(organism, cell_types, email)
        for result in results:
            if is_desired_result(result):
                with open(save_file, "w") as out_handle:
                    cPickle.dump(result, out_handle)
                break

    with open(save_file) as save_handle:
        result = cPickle.load(save_handle)

The hard work of querying GEO and retrieving the results is done by Biopython’s Entrez interface. After building up the query, the results are parsed into simple objects with a description of the expression set along with titles and identifiers for each of the samples that match our cell type:

def cell_type_gsms(organism, cell_types, email):
    """Use Entrez to retrieve GEO entries for an organism and cell type.
    """
    Entrez.email = email
    search_term = "%s[ORGN] %s" % (organism, " OR ".join(cell_types))
    print "Searching GEO and retrieving results: %s" % search_term
    
    hits = []
    handle = Entrez.esearch(db="gds", term=search_term)
    results = Entrez.read(handle)
    for geo_id in results['IdList']:
        handle = Entrez.esummary(db="gds", id=geo_id)
        summary = Entrez.read(handle)
        samples = []
        for sample in summary[0]['Samples']:
            for cell_type in cell_types:
                if sample['Title'].find(cell_type) >= 0:
                    samples.append((sample['Title'], sample['Accession']))
                    break
        if len(samples) > 0:
            hits.append(GEOResult(summary[0]['summary'], samples))
    return hits

Once the experiment is selected we’ve accomplished the first part of our goal. Next, the plan is to get expression data values. We do this by building up a dictionary that maps RefGene IDs to expression values, so the results look like:

WT ProB, biological rep1 
[('NM_177327', [7.7430266269999999, 6.4795213670000003, 8.8766985500000004]), 
 ('NM_001008785', [7.4671954649999996, 5.4761453329999998]),
 ('NM_177325', [7.3312364040000002, 11.831475960000001]),
 ('NM_177323', [6.9779868059999997, 6.3926399939999996]),
 ('NM_177322', [5.0833683379999997])]

The actual hard work is retrieving the expression data and mappings to gene identifiers, and this is done using the GEOquery library in Bioconductor. Mapping information for expression values, and high level meta-data about the experiment, are stored in local files. The expression information is stored as a R style tab delimited data table, while the meta-data is stored in JSON format. Both sets of data are then present locally in readily readable formats for further processing:

def _write_gsm_map(self, gsm_id, meta_file, table_file):
    """Retrieve GEO expression values using Bioconductor, saving to a table.
    """
    robjects.r.assign("gsm.id", gsm_id)
    robjects.r.assign("table.file", table_file)
    robjects.r.assign("meta.file", meta_file)
    robjects.r('''
      library(GEOquery)
      library(rjson)
      gsm <- getGEO(gsm.id)
      write.table(Table(gsm), file = table.file, sep = "\t", row.names = FALSE,
                  col.names = TRUE)
      cat(toJSON(Meta(gsm)), file = meta.file)
    ''')

The previous function provides a mapping between probe names and expression levels. To be useful, we also need to map each of the probes on the expression array to biological gene identifiers. This is done through a second call to the GEO library to retrieve details for the expression platform. Again the data is presented in a R data frame, which we limit to only the items of interest and save as a tab delimited file:

def _write_gpl_map(self, gpl_id, gpl_file):
    """Retrieve GEO platform data using R and save to a table.
    """
    robjects.r.assign("gpl.id", gpl_id)
    robjects.r.assign("gpl.file", gpl_file)
    robjects.r('''
      library(GEOquery)
      gpl <- getGEO(gpl.id)
      gpl.map <- subset(Table(gpl), select=c("ID", "RefSeq.Transcript.ID"))
      write.table(gpl.map, file = gpl.file, sep = "\t", row.names = FALSE,
                  col.names = TRUE)
    ''')

Putting this all together, we download each of the mapping files and then parse them, building the connection: expression to probe ID to gene transcript IDs. The final dictionary maps these gene identifiers to the expression values and returns them:

def get_gsm_tx_values(self, gsm_id, save_dir):
    """Retrieve a map of transcripts to expression from a GEO GSM file.
    """
    gsm_meta_file = os.path.join(save_dir, "%s-meta.txt" % gsm_id)
    gsm_table_file = os.path.join(save_dir, "%s-table.txt" % gsm_id)
    if (not os.path.exists(gsm_meta_file) or 
            not os.path.exists(gsm_table_file)):
        self._write_gsm_map(gsm_id, gsm_meta_file, gsm_table_file)

    with open(gsm_meta_file) as in_handle:
        gsm_meta = json.load(in_handle)
    id_to_tx = self.get_gpl_map(gsm_meta['platform_id'], save_dir)
    tx_to_vals = collections.defaultdict(list)
    with open(gsm_table_file) as in_handle:
        reader = csv.reader(in_handle, dialect='excel-tab')
        reader.next() # header
        for probe_id, probe_val in reader:
            for tx_id in id_to_tx.get(probe_id, []):
                tx_to_vals[tx_id].append(float(probe_val))
    return tx_to_vals

The full script puts all of these parts together into a working example. In the end we go from our query to the expression data we need, although the route is a bit more circuitous then the idealized one described at the start of the post. Simplifying this type of question based approach to data access is one challenge bioinformatics tool developers will continue to face.

Written by Brad Chapman

January 2, 2010 at 5:43 pm

10 Responses

Subscribe to comments with RSS.

  1. Brad,

    Good to see you’re still posting. I’ve found some of your code to be quite useful. Thanks!

    – Nick

    Nick

    January 5, 2010 at 8:44 am

  2. Once more an interesting demonstration of how some tools can be bridged.

    This is also making me notice that some features in rpy2 are probably not as well documented as they should.

    The use of a snippets of R code is perfectly fine,
    however the option to move the code logic to Python exists (can make it easier for debugging with pdb,
    allows the use of anonymous R variable that do not stay in R’s “globalenv”).

    (note: the following code is taking advantage of the features in rpy2-2.1.0, but the same idea could be used with 2.0.x…)

    from rpy2.robjects.packages import importr
    utils    = importr("utils")
    geoquery = importr("GEOquery")
    rjson    = importr("rjson")
    
    def _write_gsm_map(self, gsm_id, 
                       meta_file, table_file):
        """Retrieve GEO expression values using Bioconductor, saving to a table and a jason file.
        """
        gsm = geoquery.getGEO(gsm_id)
        utils.write_table(geoquery.Table(gsm),
                          file = table_file,
                          sep = "\t",
                          row_names = False,
                          col_names = True)
        json_data = rjson.toJSON(geoquery.Meta(gsm))
        fh = file(meta_file, "w")
        fh.write(json_data)
        fh.close()
    
    

    One can as also return “json_data” rather than save it into a file then read it immediately from the file.

    L.

    Laurent

    January 6, 2010 at 2:17 am

    • Laurent;
      Thanks for the more pythonic example — that’s really useful. Personally, I go back and forth on whether I prefer that way or the code snippets approach. Moving into snippets has been helping me practice my R skills, but does require the reader to feel comfortable with both Python and R syntax.

      Thanks also for all your hard work on rpy. 2.1.0 is awesome; I love the ggplot2 integration.

      Brad

      Brad Chapman

      January 6, 2010 at 7:49 am

  3. Hi Brad,

    I’m a post-doc in George Church’s lab at Harvard; I’ve been trying to contact you using your address in your code repositories (the synbio package) to no avail. Could you drop me a line when you get a chance.

    Thanks.

    Sri Kosuri

    February 2, 2010 at 3:22 pm

    • Similar issue for me. I had something to post off-blog, but had a hard time finding an email address.

      lgautier

      February 7, 2010 at 10:44 am

  4. Hey man just saying thanks for the great article it was very useful to what I’m doing at the moment.

    Paul

    February 9, 2010 at 3:43 pm

  5. Hi,

    This is indeed a very nice little script you have made here.
    Actually, I wrote a similar one based on yours, that does not depend on Rpy2 and the GEOquery package. and is purely in python. could that be something for BioPython?

    Nicolas Rapin

    March 22, 2010 at 5:15 am

  6. Just FYI there is a great download of the meta information from GEO @ this web site:

    http://gbnci.abcc.ncifcrf.gov/geo/

    Like your python implementation. ;-)

    Peace,
    Chris

    Chris Plaisier

    September 3, 2010 at 7:00 pm


Leave a comment