Blue Collar Bioinformatics

Automated retrieval of expression data with Python and R

with 6 comments

In February, I’ll be headed to Japan for BioHackathon 2010, which focuses on integrating biological datasets. In preparation, I’ve been thinking about use cases: when working with a set of data, what other information do you want to able to readily retrieve that will help you better answer your questions? One common set of data is gene expression information; you identify a group of genes of interest and are curious about their expression levels.

In an ideal semantic world, you would be able to build a query through a set of RDF triples like:

  • Expression set -> has organism -> Mus musculus
  • Mus musculus -> has cell type -> proB
  • proB -> has biological state -> wild type
  • Expression set -> has identifier -> A RefGene ID
  • A RefGene ID -> has expression value -> 7.65

So you could find expression sets unique to your cell type under standard conditions, and then identify the expression values of your genes. Ideally, expression sets would also have additional meta-data associated with them so you’d know what 7.65 meant in the context of the experiment.

Although we can’t do this query through semantic channels, there are a number of toolkits that help us answer the question in an automated way. NCBI’s Gene Expression Omnibus (GEO) provides an online resource hosting expression data. You can query the data sets with EUtils web API using Biopython. Once a data set of interest is identified, Bioconductor’s GEOquery can retrieve the data sets and mappings to UCSC RefGene identifiers. All of this is tied together with the Rpy2 python interface to R.

At the top level, we define our goal, which is to retrieve expression data for mouse proB wild type cells:

def main():
    organism = "Mus musculus"
    cell_types = ["proB", "ProB", "pro-B"]
    email = "chapmanb@50mail.com"
    save_dir = os.getcwd()
    exp_data = get_geo_data(organism, cell_types, email, save_dir,
        _is_wild_type)

def _is_wild_type(result):
    """Check if a sample is wild type from the title.
    """
    return result.samples[0][0].startswith("WT")

We query GEO for the possible data sets based on our organism and cell type, and then search through the results for a dataset that matches our criteria. In real life work, your personal _is_wild_type function may be an interactive dialog that presents the biologist with possible experiments, allowing them to select the one of interest. Note also that we store our final result locally, as a Python pickle. Practically, this reduces our load on external servers by only querying them once:

def get_geo_data(organism, cell_types, email, save_dir, is_desired_result):
    save_file = os.path.join(save_dir, "%s-results.pkl" % cell_types[0])
    if not os.path.exists(save_file):
        results = cell_type_gsms(organism, cell_types, email)
        for result in results:
            if is_desired_result(result):
                with open(save_file, "w") as out_handle:
                    cPickle.dump(result, out_handle)
                break

    with open(save_file) as save_handle:
        result = cPickle.load(save_handle)

The hard work of querying GEO and retrieving the results is done by Biopython’s Entrez interface. After building up the query, the results are parsed into simple objects with a description of the expression set along with titles and identifiers for each of the samples that match our cell type:

def cell_type_gsms(organism, cell_types, email):
    """Use Entrez to retrieve GEO entries for an organism and cell type.
    """
    Entrez.email = email
    search_term = "%s[ORGN] %s" % (organism, " OR ".join(cell_types))
    print "Searching GEO and retrieving results: %s" % search_term

    hits = []
    handle = Entrez.esearch(db="gds", term=search_term)
    results = Entrez.read(handle)
    for geo_id in results['IdList']:
        handle = Entrez.esummary(db="gds", id=geo_id)
        summary = Entrez.read(handle)
        samples = []
        for sample in summary[0]['Samples']:
            for cell_type in cell_types:
                if sample['Title'].find(cell_type) >= 0:
                    samples.append((sample['Title'], sample['Accession']))
                    break
        if len(samples) > 0:
            hits.append(GEOResult(summary[0]['summary'], samples))
    return hits

Once the experiment is selected we’ve accomplished the first part of our goal. Next, the plan is to get expression data values. We do this by building up a dictionary that maps RefGene IDs to expression values, so the results look like:

WT ProB, biological rep1
[('NM_177327', [7.7430266269999999, 6.4795213670000003, 8.8766985500000004]),
 ('NM_001008785', [7.4671954649999996, 5.4761453329999998]),
 ('NM_177325', [7.3312364040000002, 11.831475960000001]),
 ('NM_177323', [6.9779868059999997, 6.3926399939999996]),
 ('NM_177322', [5.0833683379999997])]

The actual hard work is retrieving the expression data and mappings to gene identifiers, and this is done using the GEOquery library in Bioconductor. Mapping information for expression values, and high level meta-data about the experiment, are stored in local files. The expression information is stored as a R style tab delimited data table, while the meta-data is stored in JSON format. Both sets of data are then present locally in readily readable formats for further processing:

def _write_gsm_map(self, gsm_id, meta_file, table_file):
    """Retrieve GEO expression values using Bioconductor, saving to a table.
    """
    robjects.r.assign("gsm.id", gsm_id)
    robjects.r.assign("table.file", table_file)
    robjects.r.assign("meta.file", meta_file)
    robjects.r('''
      library(GEOquery)
      library(rjson)
      gsm <- getGEO(gsm.id)
      write.table(Table(gsm), file = table.file, sep = "\t", row.names = FALSE,
                  col.names = TRUE)
      cat(toJSON(Meta(gsm)), file = meta.file)
    ''')

The previous function provides a mapping between probe names and expression levels. To be useful, we also need to map each of the probes on the expression array to biological gene identifiers. This is done through a second call to the GEO library to retrieve details for the expression platform. Again the data is presented in a R data frame, which we limit to only the items of interest and save as a tab delimited file:

def _write_gpl_map(self, gpl_id, gpl_file):
    """Retrieve GEO platform data using R and save to a table.
    """
    robjects.r.assign("gpl.id", gpl_id)
    robjects.r.assign("gpl.file", gpl_file)
    robjects.r('''
      library(GEOquery)
      gpl <- getGEO(gpl.id)
      gpl.map <- subset(Table(gpl), select=c("ID", "RefSeq.Transcript.ID"))
      write.table(gpl.map, file = gpl.file, sep = "\t", row.names = FALSE,
                  col.names = TRUE)
    ''')

Putting this all together, we download each of the mapping files and then parse them, building the connection: expression to probe ID to gene transcript IDs. The final dictionary maps these gene identifiers to the expression values and returns them:

def get_gsm_tx_values(self, gsm_id, save_dir):
    """Retrieve a map of transcripts to expression from a GEO GSM file.
    """
    gsm_meta_file = os.path.join(save_dir, "%s-meta.txt" % gsm_id)
    gsm_table_file = os.path.join(save_dir, "%s-table.txt" % gsm_id)
    if (not os.path.exists(gsm_meta_file) or
            not os.path.exists(gsm_table_file)):
        self._write_gsm_map(gsm_id, gsm_meta_file, gsm_table_file)

    with open(gsm_meta_file) as in_handle:
        gsm_meta = json.load(in_handle)
    id_to_tx = self.get_gpl_map(gsm_meta['platform_id'], save_dir)
    tx_to_vals = collections.defaultdict(list)
    with open(gsm_table_file) as in_handle:
        reader = csv.reader(in_handle, dialect='excel-tab')
        reader.next() # header
        for probe_id, probe_val in reader:
            for tx_id in id_to_tx.get(probe_id, []):
                tx_to_vals[tx_id].append(float(probe_val))
    return tx_to_vals

The full script puts all of these parts together into a working example. In the end we go from our query to the expression data we need, although the route is a bit more circuitous then the idealized one described at the start of the post. Simplifying this type of question based approach to data access is one challenge bioinformatics tool developers will continue to face.

Written by Brad Chapman

January 2, 2010 at 5:43 pm

Gene Ontology analysis with Python and Bioconductor

leave a comment »

Last post we discussed extracting differentially expressed genes from a set of short read data. The output of this analysis is a large list of genes, and the next challenge is to try and extract interesting patterns from the results. Are genes in particular families over-represented? Are there a large number of genes with a particular function? These patterns can help direct your future experiments and provide an understandable grouping for biologists examining the data.

The Gene Ontology (GO) project provides a standardized set of terms describing the molecular function of genes. We will use the topGO package from the Bioconductor project to identify over-represented GO terms from a set of differentially expressed genes. Python will be used to prepare the data, utilizing rpy2 to call R for the statistical analysis.

The first step is to parse input files describing the differentially expressed genes and the mapping of gene names to GO terms. For the example we will use a simple CSV file from our previous analysis and an equally simple file describing the gene to GO mapping. For real life work, you will need to get the GO mapping file for your organism; for instance, here is the Arabidopsis GO mapping from TAIR.

The input file of differentially expressed genes is parsed into a dictionary where the keys are gene names and the values are associated p-values:

import csv

def parse_input_csv(in_handle):
    reader = csv.reader(in_handle)
    reader.next() # header
    all_genes = dict()
    for (gene_name, _, _, pval) in reader:
        all_genes[gene_name] = float(pval)
    return all_genes

The GO mapping input file is parsed into two dictionaries: one mapping genes to GO terms for the GO analysis, and the second mapping GO terms to genes for displaying the results of the analysis:

import collections

def parse_go_map_file(in_handle, genes_w_pvals):
    gene_list = genes_w_pvals.keys()
    gene_to_go = collections.defaultdict(list)
    go_to_gene = collections.defaultdict(list)
    for line in in_handle:
        parts = line.split("\t")
        gene_id = parts[0]
        go_id = parts[1].strip()
        if gene_id in gene_list:
            gene_to_go[gene_id].append(go_id)
            go_to_gene[go_id].append(gene_id)
    return dict(gene_to_go), dict(go_to_gene)

Next we set up our associated R environment for GO analysis. This involves loading the topGO library and creating a function to classify genes as either differentially expressed, or not. We do this based on the p-value, classifying genes with a p-value of 0.01 or less as differentially expressed:

import rpy2.robjects as robjects

gene_pval = 1e-2
robjects.r('''
    library(topGO)
''')
robjects.r('''
    topDiffGenes = function(allScore) {
      return (allScore < %s)
    }
''' % gene_pval)

Now it’s time to run the analysis. The paramters are defined as a dictionary which initializes a topGOdata object:

  • ontology — the type of GO ontology to analyze; molecular function (MF) is examined here.
  • annot — how the GO terms are annotated; we specify they will be identified by a mapping of genes to GO ids, passed as the gene2GO argument.
  • geneSelectionFun — the function we defined above to determine if genes are differentially expressed.
  • allGenes — an R named vector where the names are genes and the values are p-values; we use a utility function to convert a python dictionary to the named vector.
  • gene2GO — an R named vector mapping gene names to associated GO terms.

The initialized object is run using Fisher classic analysis. Several different analysis methods and test statistics are available in the package. The resulting scores are extracted and a summary table of the results is generated:

def _dict_to_namedvector(init_dict):
    """Call R to create a named vector from an input dictionary.
    """
    return robjects.r.c(**init_dict)

go_term_type = "MF"
topgo_method = "classic" # choice of classic, elim, weight
params = {"ontology" : go_term_type,
          "annot" : robjects.r["annFUN.gene2GO"],
          "geneSelectionFun" : robjects.r["topDiffGenes"],
          "allGenes" : _dict_to_namedvector(gene_vals),
          "gene2GO" : _dict_to_namedvector(gene_to_go)
          }
go_data = robjects.r.new("topGOdata", **params)
results = robjects.r.runTest(go_data, algorithm=topgo_method,
        statistic="fisher")
scores = robjects.r.score(results)
score_names = scores.getnames()
num_summarize = min(100, len(score_names))
results_table = robjects.r.GenTable(go_data, elimFisher=results,
        orderBy="elimFisher", topNodes=num_summarize)

The scores and results table are used to generate a list of over-represented GO terms with their associated p-values and descriptions:

# extract term names from the topGO summary dataframe
GO_ID_INDEX = 0
TERM_INDEX = 1
ids_to_terms = dict()
for index, go_id in enumerate(results_table[GO_ID_INDEX]):
    ids_to_terms[go_id] = results_table[TERM_INDEX][index]
go_terms = []
# convert the scores and results information info terms to return
for index, item in enumerate(scores):
    if item < go_pval:
        go_id = score_names[index]
        go_terms.append((item, go_id, ids_to_terms.get(go_id, "")))
go_terms.sort()

Finally, we print out the resulting overexpressed GO terms along with associated genes:

def print_go_info(go_terms, go_term_type, go_to_gene):
    for final_pval, go_id, go_term in go_terms:
        genes = []
        for check_go in [go_id] + get_go_children(go_id, go_term_type):
            genes.extend(go_to_gene.get(check_go, []))
        genes = sorted(list(set(genes)))
        print "-> %s (%s) : %0.4f" % (go_id, go_term, final_pval)
        for g in genes:
            print g

One tricky part of the post-analysis processing is finding all of the genes associated with an identified GO id. The GO terminology is hierarchical, so a resulting GO identifier like GO:0042578 in the example results file can represent both that term and more specific terms. A nice way to illustrate this for yourself is to look at the AmiGO browser display for the term.

The result is that genes associated with an over-represented term will come from both the identified term and more specific terms. Bioconductor has an interface to the GO database, so we can extract all of the relevant terms by moving down the GO tree collecting the more specific children, given the parent term:

def get_go_children(go_term, go_term_type):
    robjects.r('''
        library(GO.db)
    ''')
    child_map = robjects.r["GO%sCHILDREN" % (go_term_type)]
    children = []
    to_check = [go_term]
    while len(to_check) > 0:
        new_children = []
        for check_term in to_check:
            new_children.extend(list(robjects.r.get(check_term, child_map)))
        new_children = list(set([c for c in new_children if c]))
        children.extend(new_children)
        to_check = new_children
    children = list(set(children))
    return children

The full script pulls all of this code together into a working example. You will need to update the parsers to match your input and GO mapping file, but otherwise the code can plug directly into full genome analyses of differential expression data.

Written by Brad Chapman

October 18, 2009 at 11:03 am

Differential expression analysis with Bioconductor and Python

with 6 comments

This post demonstrates performing differential expression analysis of short read sequencing data using a combination of Python and the R statistical language. Python is used a glue language to manipulate and prepare count data from short read sequencing. R and the Bioconductor package are used to perform the statistical analysis. The excellent rpy2 package connection Python and R.

Scientists often need to compare expression results from multiple experimental conditions. This can be done with microarray analysis and the wide variety of associated statistical tests, but many researchers are now utilizing short read sequencing. Experiments identify transcript prevalence through digital gene expression under multiple conditions, and we are interested in extracting statistically meaningful differences from the results. Similarly, we may be interested in differences resulting from short RNA discovery or other next generation sequencing applications.

Short read differential expression analysis differs from microarray analyses as it is based on counts instead of intensity scores. The edgeR package in Bioconductor fits count data to a statistical model considering sample to sample variation, and performs Fisher’s exact test to provide p-values associated with changes in expression between samples.

Here we will consider the case of a two experiment analysis. A simple Example CSV file has counts for 5 different genes under 2 conditions and total count information for the full experiment. We read the file into a dictionary of counts keyed by each condition and the gene name:

import csv
import collections

def read_count_file(in_file):
    """Read count information from a simple CSV file into a dictionary.
    """
    counts = collections.defaultdict(dict)
    with open(in_file) as in_handle:
        reader = csv.reader(in_handle)
        header = reader.next()
        conditions = header[1:]
        for parts in reader:
            region_name = parts[0]
            region_counts = [float(x) for x in parts[1:]]
            for ci, condition in enumerate(conditions):
                counts[condition][region_name] = region_counts[ci]
    return dict(counts)

The dictionary is organized into NumPy matrices; NumPy is a powerful numerical package for Python which integrates smoothly with our rpy2 interface to import directly into R. Here we organize our conditions and genes and then push the count data into a matrix where the columns are conditions and the rows are genes; each item is a count value. This is returned along with associated data:

  • groups — A list of experimental groups, which can be used to analyze replicates. Here, two groups with a single replicate each are defined.
  • sizes — The total number of counts for each experiment. This is extracted from the “Total” row of the CSV count file, and will be used for normalization of during the statistical analysis.

import numpy

def get_conditions_and_genes(work_counts):
    conditions = work_counts.keys()
    conditions.sort()
    all_genes = []
    for c in conditions:
        all_genes.extend(work_counts[c].keys())
    all_genes = list(set(all_genes))
    all_genes.sort()
    sizes = [work_counts[c]["Total"] for c in conditions]
    all_genes.remove("Total")
    return conditions, all_genes, sizes

def edger_matrices(work_counts):
    conditions, all_genes, sizes = get_conditions_and_genes(work_counts)
    assert len(sizes) == 2
    groups = [1, 2]
    data = []
    final_genes = []
    for g in all_genes:
        cur_row = [int(work_counts[c][g]) for c in conditions]
        if sum(cur_row) > 0:
            data.append(cur_row)
            final_genes.append(g)
    return (numpy.array(data), numpy.array(groups), numpy.array(sizes),
            conditions, final_genes)

The organized data is now ready to be pushed directly into an edgeR Bioconductor analysis using the rpy2 interface. Three R functions are called: the data matrices are organized into a DGEList object, this object is passed to deDGE which does the actual digital gene expression analysis, and finally topTags is called to retrieve a vector of differential expression p-values. The vector is translated into a Python object from which we extract the p-values and re-organize them by the initial gene indexes. The ordered p-values are then returned.

import rpy2.robjects as robjects
import rpy2.robjects.numpy2ri

def run_edger(data, groups, sizes, genes):
    robjects.r('''
        library(edgeR)
    ''')
    params = {'group' : groups, 'lib.size' : sizes}
    dgelist = robjects.r.DGEList(data, **params)
    ms = robjects.r.deDGE(dgelist, doPoisson=True)
    tags = robjects.r.topTags(ms, pair=groups, n=len(genes))
    indexes = [int(t) - 1 for t in tags.rownames()]
    pvals = list(tags.r['adj.P.Val'][0])
    assert len(indexes) == len(pvals)
    pvals_w_index = zip(indexes, pvals)
    pvals_w_index.sort()
    assert len(pvals_w_index) == len(indexes)
    return [p for i,p in pvals_w_index]

The final results are written to a CSV file with our ordered genes, conditions and p-value probabilities. Each gene is associated with the initial count data and p-values, the genes are sorted by p-value with the most differentially expressed genes placed first, and the ordered information is written out:

def write_outfile(outfile, genes, conditions, work_counts, probs):
    with open(outfile, "w") as out_handle:
        writer = csv.writer(out_handle)
        writer.writerow(["Region"] +
                ["%s count" % c for c in conditions] + ["edgeR p-value"])
        out_info = []
        for i, gene in enumerate(genes):
            counts = [int(work_counts[c][gene]) for c in conditions]
            out_info.append((probs[i], [gene] + counts))
        out_info.sort()
        [writer.writerow(start + [prob]) for prob, start in out_info]

Our example output file shows the results. The full script is available for you to use and customize for your own analyses. This example demonstrates the power of the rpy2 interface. Input and output data is happily manipulated in a familiar language such as Python, and can be seamlessly integrated with excellent statistical computation in Bioconductor and other specialized R packages.

Written by Brad Chapman

September 13, 2009 at 8:28 pm

Usage plans for Amazon Web Services research grant

with 5 comments

Amazon Web Services provide an excellent distributed computing infrastructure through their Elastic Compute Cloud (EC2), Elastic Block Storage (EBS) and associated resources. Essentially, they make available on demand compute power and storage at prices that scale with usage. In the past I’ve written about using EC2 for parallel parsing of large files. Generally, I am a big proponent of distributed computing as a solution to dealing with problems ranging from job scaling to improving code availability.

One of the challenges in advocating for using EC2 at my day to day work is the presence of existing computing resources. We have servers and clusters, but how will we scale for future work? Thankfully, we are able to assess the utility of Amazon services for future scaling through their education and research grants. Our group applied and was accepted for a research grant which we plan to use to develop and distribute next generation sequencing analyses both within our group at Mass General Hospital and in the larger community.

Amazon Machine Images (AMIs) provide an opportunity for the open source bioinformatics community to increase code availability. AMIs are essentially pre-built operating systems with installed programs. By creating AMIs and making them available, a programmer can make their code readily accessible to users and avoid any of the intricacies of installation and configuration. Add this to available data in the form of public data sets and you have a ready to go analysis platform with very little overhead. There is already a large set of available AMIs from which to build.

This idea and our thoughts on moving portions of our next generation sequencing analysis to EC2 are fleshed out further in our research grant application, portions of which are included below. We’d love to collaborate with others moving their bioinformatics work to Amazon resources.

Research Background

One broad area of rapid growth in biology research is deep sequencing (or short read) technology. A single lab investigator can produce hundreds of millions of DNA sequences, equivalent in scale to the entire human genome, in a period of days. This DNA sequencing technology is widely available through both on-site facilities as well as through commercial services. Creating scalable analysis methods is a high priority for the entire bioinformatics community; see http://selab.janelia.org/people/eddys/blog/?p=123 for a presentation nicely summarizing the issues. We propose to address the computational bottlenecks resulting from this huge data volume using distributed AWS resources.

An additional aim of our work is to provide tools to biologists looking to solve their data analysis challenges. When the computational portion of a project becomes a time limiting step, we can often speed up the cycling between experiment and analysis by providing researchers with ready to run scripts or web interfaces. However, this is complicated by high usage on shared computational resources and heterogeneous platforms requiring time consuming configuration. Both problems could be ameliorated by scalable EC2 instances with custom configured machine images.

The goals of this grant application are to develop our analysis platform on Amazon’s compute cloud and assess transfer, storage and utilization costs. We currently have internal computational resources ranging from high performance clusters to large memory machines. We believe Amazon’s compute cloud to be an ideal solution as our analysis needs outgrow our current hardware.

Benefits to Amazon and the community

Developing software on AWS architecture presents a move towards a standard platform for bioinformatics research. Our group is invested in the open source community and shares both code and analysis tools. One common hindrance to sharing is the heterogeneity of platforms; code is developed on a local cluster and not readily generalizable, hence it is not shared.

By building public machine images along with reusable source code, a diverse variety of users can readily use our code and tools. As short read sequencing continues to increase in utility and popularity, a practical ready-to-go platform for analyses will encourage many users to adopt parallelization on cloud resources as a research approach. We have begun initial work with this paradigm by developing parsers for large annotation files using MapReduce on EC2.

Having the ability to utilize AWS with your support will help us further develop and disseminate analysis templates for the larger biology community, enabling science both at MGH and elsewhere.

Written by Brad Chapman

September 7, 2009 at 7:42 pm

Trimming adaptors from short read sequences

with 7 comments

One common post processing step for next generation, or short read, sequencing is to remove adaptor sequences. Sample preparation for sequencing often involves selection of a particular fraction of interest like expressed genes, microRNAs, or copy number variant regions. Selecting, amplifying, and sequencing these regions can result in read sequences containing a common tag that needs to be identified and removed before continuing with other analysis steps like alignment to the genome. This is often the case when the reads of interest are small like in short RNAs or small sequence tags.

We start with a short known sequence (the adaptor) that will be present in our sequences of interest. For each read, we want to determine the position of this adaptor and remove it. Additionally, reads may contain the adaptor but differ by one or more bases; these differences are likely due to sequencing errors. So our solution needs to allow some fuzzy matching of the adaptor to the read sequences.

My first attempt involved using fuzzy string matching. In Python, several libraries exist to tackle this problem. The normal use case is in identifying misspelled words in text. For instance, you can calculate Levenshtein (edit) distance or use difflib in the standard library; Stack Overflow has a nice discussion of the different modules. Ultimately this approach proved to be a dead end; the properties that make English words similar do not overlap well with those differentiate DNA sequences.

That realization led me to tackle the problem by computing local alignments of the adaptor to each sequence. This effectively captures the biological intuition you need to trim a sequence with less than a specified number of errors. The downside to this rigorous approach is that it will be slower than purely string based methods.

The Biopython library contains a Python local alignment function suitable for quick alignment of short regions. We use this to align an adaptor region to a sequence and calculate the number of differences in the aligned region. To handle a common case where we can find the exact adaptor sequence, we first do an string match. This avoids the alignment overhead in many cases:

from Bio import pairwise2

def trim_adaptor(seq, adaptor, num_errors, right_side=True):
    gap_char = '-'
    exact_pos = str(seq).find(adaptor)
    if exact_pos >= 0:
        seq_region = str(seq[exact_pos:exact_pos+len(adaptor)])
        adapt_region = adaptor
    else:
        seq_a, adaptor_a, score, start, end = pairwise2.align.localms(str(seq),
                str(adaptor), 5.0, -4.0, -9.0, -0.5,
                one_alignment_only=True, gap_char=gap_char)[0]
        adapt_region = adaptor_a[start:end]
        seq_region = seq_a[start:end]
    matches = sum((1 if s == adapt_region[i] else 0) for i, s in
            enumerate(seq_region))
    # too many errors -- no trimming
    if (len(adaptor) - matches) > num_errors:
        return seq
    # remove the adaptor sequence and return the result
    else:
        return _remove_adaptor(seq, seq_region.replace(gap_char, ""),
                right_side)

If the alignment contains fewer than the specified number of errors we’ve found an adaptor and remove it, returning the trimmed sequence. The attribute right_side allows us to specify whether the trimming should be done on the right (3′ end) or the left (5′ end):

def _remove_adaptor(seq, region, right_side=True):
    """Remove an adaptor region and all sequence to the right or left.
    """
    if right_side:
        try:
            pos = seq.find(region)
        # handle Biopython SeqRecords
        except AttributeError:
            pos = seq.seq.find(region)
        return seq[:pos]
    else:
        try:
            pos = seq.rfind(region)
        # handle Biopython SeqRecords
        except AttributeError:
            pos = seq.seq.rfind(region)
        return seq[pos+len(region):]

Here is an example script using this function along with Biopython to trim reads from a FASTQ format file.Each read is analyzed to determine if it contains the adaptor, with up to 2 errors. If the adaptor is identified and trimmed, the final useful read is written to a FASTA format output file.

from __future__ import with_statement
import sys
import os

from Bio import SeqIO

from adaptor_trim import trim_adaptor

def main(in_file):
    adaptor_seq = "TCGTATGCCGTCTTCTGC"
    num_errors = 2
    base, ext = os.path.splitext(in_file)
    out_file = "%s-trimmed.fa" % (base)

    with open(in_file) as in_handle:
        with open(out_file, "w") as out_handle:
            for rec in SeqIO.parse(in_handle, "fastq"):
                trim = trim_adaptor(rec.seq, adaptor_seq, num_errors)
                if len(trim) > 0 and len(trim) < len(rec):
                    rec.letter_annotations = {}
                    rec.seq = trim
                    SeqIO.write([rec], out_handle, "fasta")

You can get the full trimming code, along with tests, from GitHub. The script takes about a half hour to process 15 million 36bp reads. Other current trimming approaches are often integrated into the aligners themselves: for instance, NovoAlign appears to have similar alignment based trimming capabilities.

Written by Brad Chapman

August 9, 2009 at 8:18 pm

Sorting genomic alignments using Python

leave a comment »

Recent, I was asked to sort a MAF (Multiple Alignment Format) file containing genomic alignments. This was a large vertebrate alignment file and the researchers were interested in manually examining the longest matches. The tricky part of doing this is that the entire file cannot comfortably fit in memory on a standard machine.

The common solution to dealing with large memory intensive files is to pre-parse the file and provide a way to look up individual records based on an identifier. For instance, items could be pushed into a relational database table and retrieved based on primary keys. While moving to disk access is less efficient, it allows your code to scale to these type of real life problems without resorting to buying large memory machines.

We will use the bx-python library to read the MAF alignment files, prepare an index for accessing the alignments individually, and ultimately perform the sorting.

The first step is to define the index retrieval method. bx-python has index retrival functionality that allows querying and retrieving records based on genomic location. We don’t need that advanced query capability here, and can retrieve records based on position in the file as well. We build the index by looping through each record and recording the file position and genomic locations in the index.

from bx.align import maf
from bx import interval_index_file

def build_index(in_file, index_file):
    indexes = interval_index_file.Indexes()
    with open(in_file) as in_handle:
        reader = maf.Reader(in_handle)
        while 1:
            pos = reader.file.tell()
            rec = reader.next()
            if rec is None:
                break
            for c in rec.components:
                indexes.add(c.src, c.forward_strand_start,
                        c.forward_strand_end, pos, max=c.src_size )

    with open(index_file, "w") as index_handle:
        indexes.write(index_handle)

Now we have a unique index to retrieve a record — the position in the file. The second step is to loop through the file again and build a list of alignment sizes and positions. Alignment size is the first value in each list item, since this is the criteria to sort by. This list easily fits in memory, and we sort it with the largest alignments first.

rec_info = []
with open(in_file) as in_handle:
    reader = maf.Reader(in_handle)
    while 1:
        pos = reader.file.tell()
        rec = reader.next()
        if rec is None:
            break
        rec_info.append((rec.text_size, pos))
rec_info.sort(reverse=True)

Finally, we loop though the sorted list of sizes and use the associated positions to get records from our index. Each record is then written to an output file.

index = maf.Indexed(in_file, index_file)
with open(out_file, "w") as out_handle:
    writer = maf.Writer(out_handle)
    for size, pos in rec_info:
        rec = index.get_at_offset(pos)
        writer.write(rec)

The full script puts this together into a usable program. This could be adjusted to deal with other alignment formats supported by bx-python, like axt files.

Written by Brad Chapman

July 26, 2009 at 3:52 pm

Posted in OpenBio

Tagged with , ,

Thoughts from BOSC 2009; Python in bioinformatics

with 4 comments

I’d like to share my thoughts on two major themes that emerged from my trip to Sweden for the 2009 Bioinformatics Open Source Conference (BOSC). I talked briefly on publishing biological data on the web; you can check out the slides on Slideshare. This lead to discussion with several folks from the open source and Python bioinformatics communities. The major themes of these conversations were: organization within the Python bioinformatics community and growth of a platform for developing web enabled applications.

Python in Bioinformatics

One of the unique elements of the Python bioinformatics community is that the work is distributed amongst several different packages. Unlike Perl, where many programmers regularly consolidate their code into the BioPerl project, the Python community has settled on a few different packages: Biopython, bx-python, pygr and PyCogent are a few popular ones. Instead of working with a monolithic code base, users can pick functionality amongst several choices.

This distinctive organization is not a bad thing. It avoids creating an unwieldy package which can be hard to maintain and re-factor. It allows programmers to explore solutions in ways better suited to their particular problems. Finally, it provides individual recognition for the hard work researchers put into building and maintaining reusable code.

What the distribution does mean is that the Python bioinformatics community needs to work harder at communication and coordination. One great idea was to write a grant to help bring Python biology programmers together for a conference and hacking session, ideally alongside a conference like BOSC or SciPy. This would provide an impetus for contributors to learn and discuss each others platforms. Beyond goodwill and community, the deliverables would be documentation and code contributing to integration between projects. This would enable scientists by lowering the learning curves to producing useful biology related code in Python.

Galaxy

My talk focused on developing small, reusable presentation and backend components for building web enabled applications. One really insightful question asked whether the community should focus on building a platform these components could be plugged into, or if components themselves would eventually evolve towards a larger structure. The first idea was taken very successfully by the Firefox browser with their plugin architecture, which allows end users to build amazing web interfacing applications within the browser. The second approach is the one I am more used to taking: build relatively smaller things that work, with an eye towards integration.

A discussion with James Taylor convinced me that it was worthwhile to take a longer look at the Galaxy project. Galaxy is an excellent web based front end to many bioinformatics programs and scripts, allowing biologists to put together analysis pipelines. They have a powerful public site which means using Galaxy requires no installation for many use cases.

The code is also publicly available and you can run your own local Galaxy instances, plugging in custom programs through their XML based tool interface. The architecture of the system is remarkably similar to the one I converged on for my work. It features a Pylons like backend, SQLAlchemy for databases interactivity, and jQuery for the javascript interface. Deployment to cloud infrastructure can be done on Amazon EC2.

We have installed Galaxy locally and are taking it for a spin for our data presentation tasks. The tool plugin interface works as described and we have had good luck integrating it with custom input types. I will be trying more complex integrations with custom display and more Python code on the backend and hopefully have future posts covering that. Generally, I hope Galaxy can serve as a platform in which custom presentation code can be built, distributed, and reused.

I’d be happy to hear your thoughts about either the biology Python community or Galaxy as a platform for web presentation work.

Written by Brad Chapman

July 19, 2009 at 8:13 pm

Posted in OpenBio

Tagged with , ,

Talking at BOSC 2009 about publishing biological data on the web

with 3 comments

The Bioinformatics Open Source Conference (BOSC) is taking place later this month in Stockholm, Sweden. I will be attending for the first time in a few years, and giving a short and sweet 10 minute talk about ideas for publishing biological data on the web. BOSC provides a chance to meet and talk with many of the great people involved in open source bioinformatics; the schedule this year looks fantastic. The talk will be held in conjunction with The Data and Analysis Management special interest group, also full of interesting talks.

The talk will promote development of reusable web based interface libraries layered on top of existing open source projects. The PDF abstract provides the full description and motivation; below is a more detailed outline based on some brainstorming and organization:

  • Motivation: rapidly organize and display biological data in a web accessible format.
  • Current state: reusable bioinformatics libraries targeted at programmers — Biopython, bx-python, pygr, PyCogent
  • Current state: back end databases for storing biological data — BioSQL, GMOD
  • Current state: full featured web applications targeted at users — Galaxy, GBrowse
  • My situation: biologist and developer with organized data that needs analysis and presentation, internally with collaborators and externally with larger community.
  • Proposal: integrate bioinformatics libraries, database schemas, and open source web development frameworks to provide re-usable components that can serve as a base for custom data presentation.
  • Framework: utilize cloud infrastructure for reliable deployment — Google App Engine, Amazon EC2
  • Framework: make use of front end javascript frameworks — jQuery, ExtJS.
  • Framework: make use of back end web frameworks — Pylons
  • Implementation: Demo server for displaying sequences plus annotations
  • Implementation: Utilizes BioSQL schema, ported to object oriented data store; Google App engine backend or MongoDB backend
  • Implementation: Data import/export with Biopython libraries — GenBank in and GFF out
  • Implementation: Additional screenshots from internal web displays.
  • Challenges: Generalizing and organizing display and retrieval code without having to buy into a large framework.
  • Challenges: Re-usable components for cross-language functionality; javascript front end displays for multi-language back ends.
  • Challenges: Build a community that thinks of reusing and sharing display code as much as parsing and pipeline development code.

I would be happy to hear comments or suggestions about the talk. If you’re going to BOSC and want to meet up, definitely drop me a line.

Written by Brad Chapman

June 11, 2009 at 7:41 am

Python libraries for synthetic biology

with 8 comments

Researchers today are empowered with the ability to devise DNA sequences and order these designs for fabrication. Starting with known pieces from biological systems, variations can be conceptualized and explored with increasing ease. As a result, the field of synthetic biology has expanded along with techniques used to provide larger pieces of synthesized DNA. Companies such as DNA 2.0 and GENEART provide on-demand synthesis services. Additionally, many researchers design, optimize and generate custom DNA sequences in their own lab.

My previous job was at a synthetic biology start up working on the software side of their fabrication and design services. The company was unfortunately a recent victim of the economy. The learning from five years of in the trenches synthesis work will be very useful to others tackling similar projects. With that in mind, the Python libraries used for construction, sequencing verification, and many other synthetic biology related tasks are being made available on Bitbucket.

Some highlights of the library:

The code in this library was integrated as part of an automated synthesis platform. As a result, it does not feature ready to run scripts. Rather, the code can serve as a source for developing these type of tools. I would love to see the most useful parts refactored and integrated into existing software communities like Biopython. Please drop a note if you have any interest in collaborating on such a project.

Written by Brad Chapman

May 31, 2009 at 9:28 pm

Evaluating key-value and document stores for short read data

with 18 comments

Designing responsive web interfaces for analyzing short read data requires techniques to rapidly retrieve and display all details associated with a read. My own work on this has been relying heavily on Berkeley DB key/value databases. For example, an analysis will have key/value stores relating the read to aligned positions in the genome, counts of reads found in a sequencing run, and other associated metadata.

A recent post by Pierre on storing SNPs in CouchDB encouraged me to evaluate my choice of Berkeley DB for storage. My goals were to move to a network accessible store, and to potentially incorporate the advanced query features associated with document oriented databases. I had found myself embedding too much logic about database location and structure into my code while developing with Berkeley DB.

The main factors considered in evaluating the key/value and document stores for my needs were:

  • Network accessible
  • Python library support
  • Data loading time
  • Data query time
  • File storage space
  • Implementation of queries beyond key/value retrieval

Another relevant consideration, which is not as important to my work but might be to yours, is replicating and distributing a database across multiple servers. These are major issues when developing websites with millions of concurrent users; in science applications I am less likely to find that kind of popularity for short read SNP analysis.

Leonard had recently posted a summary of his experience evaluating distributed key stores, which served as an excellent starting point for looking at the different options out there. I decided to do an in-depth evaluation of three stores:

Tokyo Cabinet is a key/value store which received a number of excellent reviews for its speed and reliability. CouchDB and MongoDB are document oriented stores which offer additional query capabilities.

To evaluate the performance of these three stores, frequency counts for 2.8 million unique reads were loaded. Real stores would have additional details on each read, but the general idea is the same: a large number of relatively small documents. Each of the stores was accessed across the network on a remote machine. The 2.8 million reads were loaded, and then a half million records were retrieved from the database. The python scripts are available on github. The below table summarizes the results:

Database Load time Retrieval time File size
Tokyo Cabinet/Tyrant 12 minutes 3 1/2 minutes 24MB
CouchDB 5 1/2 minutes 14 1/2 minutes 236MB
MongoDB 3 minutes 4 minutes 192-960MB

For CouchDB, I initially reported large numbers which were improved dramatically with some small tweaks. With a naive loading strategy, times were in the range of 22 hours with large 6G files. Thanks to tips from Chris and Paul in the comments, the loading script was modified to use bulk loading. With this change, loading times and file sizes are in the range of the other stores and the new times are reflected in the table. There appear to also be some tweaks that can be made to favor speed over reliability; these tests were done with the standard configuration. The message here is to dig deeper if you find performance issues with CouchDB; small differences in usage can provide huge gains.

Based on the loading tests, I decided to investigate MongoDB and Tokyo Cabinet/Tyrant further. CouchDB loading speeds were improved dramatically by the changes mentioned above; however, retrieval speeds are about 3 times slower. Fetching single records from the database is the most important speed consideration for my work since it happens more frequently than loading, and reflects in the responsiveness of the web front end accessing the database. It is worth investigating whether client code changes can also speed up CouchDB. For Tokyo Cabinet/Tyrant and MongoDB the main performance trade-off was disk space usage for the database files. Tokyo Cabinet loads about 4 times slower, but maintains a much more compact file representation. To better understand how MongoDB stores file, I wrote to the community mailing list and received quick and thoughtful responses on the issue. See the full thread if you are interested in the details; in summary, MongoDB pre-allocated space to improve loading time and this allocation becomes less of an issue as the database size increases.

Looking beyond performance issues, Tokyo Cabinet/Tyrant and MongoDB represent two ends of the storage spectrum. MongoDB is a larger, full featured database providing complex query operations and management of multiple data stores. Tokyo Cabinet and Tyrant provide a lightweight solution for key/value retrieval. Each separate remote Tokyo Cabinet data store requires a Tyrant instance to be running. My work involves generating many different key/value databases for individual short read sequencing projects. To reasonably achieve this with remote Tyrant stores, I would need to develop a server that could start and manage Tyrant instances on demand. Additionally, if my query needs change the key/value paradigm of Tokyo Cabinet would require generating additional key values stores. For instance, we could not readily retrieve reads with a frequency greater than a defined threshold.

In conclusion, both Tokyo Cabinet/Tyrant and MongoDB proved to be excellent solutions for managing the volume and style of data associated with short read experiments. MongoDB provided additional functionality in the form of remote data store management and advanced queries which will be useful for my work; I’ll be switching my Berkeley DB stores over to MongoDB and continuing to explore its capabilities. I would welcome hearing about the solutions others have employed for similar storage and query issues.

Written by Brad Chapman

May 10, 2009 at 10:28 am

Posted in storage

Tagged with , ,