Blue Collar Bioinformatics

Note: new posts have moved to Please look there for the latest updates and comments

Posts Tagged ‘python

An automated ensemble method for combining and evaluating genomic variants from multiple callers

with 10 comments


A key goal of the Archon Genomics X Prize infrastructure is development of a set of highly accurate reference genome variants. I’ve described our work preparing these reference genomes, and specifically defined the challenges behind merging genomic variant calls from multiple technologies and calling methods. Comparing calls from two different calling methods, for example GATK and samtools mpileup, produces a large number of differing variants which need reconciliation. Taking the overlapping subset from multiple callers is too conservative and will miss real variations, while including all calls is too liberal and introduces false positives.

Here I’ll describe a fully automated approach for preparing an accurate set of combined variant calls. Ensemble machine learning methods are a powerful way to incorporate inputs from multiple models. We use a heuristic and support vector machine (SVM) algorithm to consolidate variants, producing a final set of calls with better sensitivity and specificity than current best practice methods. The approach is open source, fully automated and generalizable to both human diploid sequencing as well as X Prize haploid reference fosmids.

We use a pair of replicates from EdgeBio’s clinical exome sequencing pipeline to prepare ensemble variant calls in the widely studied HapMap NA12878 genome. Compared to variants from a single calling method, the ensemble method produced more concordant variants when comparing the replicates, with fewer discordants. The finalized ensemble calls also provide a useful method to compare strengths and weaknesses of individual calling methods. The implementation is freely available and I’ll discuss how to get it running on your data so you can use, critique and extend the methods. This work is a collaboration between Harvard School of Public Health, EdgeBio and NIST.

Comparison materials and algorithm

A difficult aspect of evaluating variant calling methods is establishing a reference set of calls. For X Prize we use three established methods, each of which comes with tradeoffs. Metrics like transition/transversion ratios or dbSNP overlap provide a global picture of calling but are not fine grained enough to distinguish improvements over best practices. Sanger validation restricts you to a manageable subset of calls. Comparisons against public resources like 1000 genomes bias results towards technologies and callers used in preparing those variant callsets.

Here we employ a fourth method by comparing replicates from EdgeBio’s clinical exome sequencing pipeline. These are NA12878 samples independently prepared using Nimblegen’s version 3.0 kit and sequenced on an Illumina HiSeq. By comparing the replicates in regions with 4 or more reads in both samples, we identify the ability of variant calling algorithms to call identical variations with differing coverage and error profiles.

We aligned reads with novoalign and performed deduplication, base recalibration and realignment using GATK best practices. With these prepared reads, we called variants with five approaches:

  • GATK UnifiedGenotyper – Bayesian approach to call SNPs and indels, treating each position independently.
  • GATK HaplotypeCaller – Performs local de-novo assembly to call SNPs and indels on individual haplotypes.
  • FreeBayes – Bayesian calling approach that handles simultaneous SNPs and indel calling via assessment of regional haplotypes.
  • samtools mpileup – Uses an approach similar to GATK’s UnifiedGenotyper for SNP and indel calling.
  • VarScan – Calls variants using a heuristic/statistic approach eliminating common sources of bias.

We took a combined heuristic and machine learning approach to consolidate these five sets of variant calls into a final ensemble callset. The first step is to prepare the union of all variant calls from the input callers, identifying calling methods that support each variant. Secondly, we annotate each variant with metrics including strand bias, allele balance, regional sequence entropy, position of calls within reads, regional base quality and overall genotype likelihoods. We then filter this prepared set of all possible variants to produce a final set of trusted calls.

The first filtering step is to heuristically identify trusted variants based on the number of callers supporting them. This configurable parameter allow you to make an initial conservative cutoff for including variants in the final calls: I trust variants supported by N or more callers.

For the remaining calls that fall below the trusted support cutoff, we distinguish true and false positives using a support vector machine (SVM). The annotated metrics described above are the input parameters and we prepare true and false positives for the classifier using a multi-step process:

  • Use variants found in all callers as the true positive set, and variants found in a single caller as false positives. Use these training variants to identify an initial set of below-cutoff variants to include and exclude.
  • With this initial set of below-cutoff true/false variants, re-train multiple classifiers stratified based on variant characteristics: variant type (indels vs SNPs), zygosity (heterozygous vs homozygous) and regional sequence complexity.
  • Use these final classifiers to identify included and excluded variants falling below the trusted calling support cutoff.

The final set of calls includes the trusted variants and those that pass the SVM filtering. An input configuration file for variant preparation and assessment allows adjustment of the trusted threshold as well as defining which metrics to use for building the SVM classifiers.

Ensemble calling improvements

We assess calling sensitivity and specificity by comparing concordant and discordant variant calls between the replicates. To provide a consistent method to measure both SNP and indel correctness, we use the positive predictive value as the percentage of concordant calls between duplicates (concordant variants / (concordant variants + discordant variants)). This is different than the overall concordance rate, which also includes non-variant regions where both replicates do not call a variation. As a result these percentages will be lower if you expect the 99% values that result when including reference calls. The advantage of this metric is that it’s easily interpreted as the percentage of concordant called variants. It also allows individual comparisons of SNPs and indels, since the overall number of indels are low compared to the total bases considered. GATK’s VariantEval documentation has a nice discussion of alternative metrics to genotype concordance.

As a baseline we used calls from GATK’s UnifiedGenotyper to represent a current best practice approach. GATK calls 117079 SNPs, 86.6% of which are concordant. It also calls 14966 indels, with 64.6% concordant. Here are the full concordant and discordant numbers, broken down by variant type and replicate:

concordant: total 111159
concordant: SNPs 101495
concordant: indels 9664
rep1 discordant: total 9857
rep1 discordant: SNPs 7514
rep1 discordant: indels 2343
rep2 discordant: total 11029
rep2 discordant: SNPs 8070
rep2 discordant: indels 2959
het/hom discordant 4181

Our ensemble method produces improvements in both total concordant variants detected and the ratio of concordant to discordants. For SNPs, the ensemble calls add 5345 additional variants to a total of 122424, with an increase in concordance to 87.4%. For indels the major improvement is in removal of discordants: We identify 14184 indels with 67.2% concordant. Here is the equivalent table for the ensemble method:

concordant: total 116608
concordant: SNPs 107063
concordant: indels 9545
rep1 discordant: total 9555
rep1 discordant: SNPs 7581
rep1 discordant: indels 1974
rep2 discordant: total 10445
rep2 discordant: SNPs 7780
rep2 discordant: indels 2665
het/hom discordant 3975

For scientists who have worked to increase sensitivity and specificity of individual variant callers, it’s exciting to be able to improve both simultaneously. As mentioned above, you can also tune the method to increase specificity or sensitivity by adjusting the support needed for including trusted variants.

The final ensemble callsets from both replicates are available as VCF files from GenomeSpace in the xprize/NA12878-exome-v_03 folder:

Comparison of calling methods

Calling the same samples with multiple callers allows direct comparisons between calling methods. The advantage of producing an accurate final set of ensemble calls is that this provides a baseline to evaluate the strengths and weaknesses of different calling methods. The figure below compares concordant, missing variants and additional variants called by each of the 5 methods in comparison with the consolidated ensemble calls:

Concordance/discordance for calling methods

  • GATK UnifiedGenotyper provides the best SNP calling, followed closely by samtools mpileup.
  • For indel calling, the GATK HaplotypeCaller produces the most concordant calls followed by UnifiedGenotyper and FreeBayes. UnifiedGenotyper does good as well, but is conservative and has the fewest additional indels. FreeBayes and GATK HaplotypeCaller both provide resolution of individual haplotypes which helps in regions with heterozygous indels or closely spaced SNPs and indels.
  • If you want to use a single variant caller, GATK UnifiedGenotyper does the best overall job.
  • If you wanted to choose free open-source tools for calling, I would recommend samtools for SNP calling and FreeBayes for indel calling.

Variant calling methods with recommendations for both calling and filtering provide the best out of the box performance. An advantage of GATK and samtools is they provide calling, variant quality metrics, and filtering. On the other side, FreeBayes is a good example of a powerful tool that takes some time to learn to filter optimally. One potential source of bias in producing the individual calls is that I personally have more experience with GATK tools so may have room to improve with the other callers.

Availability and usage

Combining multiple calling approaches improves both sensitivity and specificity of the final set of variants. The downside is the need to run and coordinate calls from all of the different callers. To mitigate this, we developed an automated pipeline that ties together multiple open-source tools using two custom components:

  • bcbio-nextgen – A Python framework to run a full sequencing analysis pipeline from input fastq files to consolidated ensemble variant calls. It supports multiple aligners and variant callers, and enables distributed work over multiple cores on a large machine or multiple machines in a cluster environment.
  • bcbio.variation – A Clojure toolkit build on top of GATK’s variant API that provides ensemble call preparation as well as more general functionality for normalizing and comparing variants produced by multiple callers.

bcbio-nextgen has a script, built on functionality in the CloudBioLinux project, that automates installation of associated variant callers and data dependencies:

python install_directory data_directory

With the dependencies installed, you describe the input files and analysis with a YAML formatted input file. The NA12878 ensemble configuration file used for this analysis provides a useful starting point. Run the analysis, distributed on multiple cores, with: bcbio_system.yaml ensemble_sample.yaml -n 8

The bcbio-nextgen documentation provides additional details about configuration inputs and distributed processing. The framework generally handles the automation and processing involved with high throughput sequencing analysis.

EdgeBio kindly made the NA12878 datasets used in this analysis publicly available:

I welcome feedback on the approach, data or tools and am actively working to extend this and make it easier to use. As re-sequencing becomes increasingly important for human health applications it is critical that we develop open, shared best-practice workflows to handle the data processing. This allows us to focus back on the fun and difficult work of understanding the biology.

Written by Brad Chapman

February 6, 2013 at 7:25 am

Distributed exome analysis pipeline with CloudBioLinux and CloudMan

with 19 comments

A major challenge in building analysis pipelines for next-generation sequencing data is combining a large number of processing steps in a flexible, scalable manner. Current best-practice software needs to be installed and configured alongside the custom code to chain individual programs together. Scaling to handle increasing throughput requires running that custom code on a wide variety of parallel architectures, from single multicore machines to heterogeneous clusters.

Establishing community resources that meet the challenges of building these pipelines ensures that bioinformatics programmers can share the burden of building large scale systems. Two open-source efforts which aim at providing this type of architecture are:

  • CloudBioLinux — A community effort to create shared images filled with bioinformatics software and libraries, using an automated build environment.

  • CloudMan — Uses CloudBioLinux as a platform to build a full SGE cluster environment. Written by Enis Afgan and the Galaxy Team, CloudMan is used to provide a ready-to-run, dynamically scalable version of Galaxy on Amazon AWS.

Here we combine CloudBioLinux software with a CloudMan SGE cluster to build a fully automated pipeline for processing high throughput exome sequencing data:

  • The underlying analysis software is from CloudBioLinux.
  • CloudMan provides an SGE cluster managed via a web front end.
  • RabbitMQ is used for communication between cluster nodes.
  • An automated pipeline, written in Python, organizes parallel processing across the cluster.

Below are instructions for starting a cluster on Amazon EC2 resources to run an exome sequencing pipeline that processes FASTQ sequencing reads, producing fully annotated variant calls.

Start cluster with CloudBioLinux and CloudMan

Start in the Amazon web console, a convenient front end for managing EC2 servers. The first step is to follow the CloudMan setup instructions to create an Amazon account and set up appropriate security groups and user data. The wiki page contains detailed screencasts. Below is a short screencast showing how to boot your CloudBioLinux specific CloudMan server:

Once this is booted, proceed to the CloudMan web interface on the server and startup an instance from this shared identifier:


This screencast shows all of the details, including starting an additional node on the SGE cluster:

Configure AMQP messaging

Edit: The AMQP messaging steps have now been full automated so the configuration steps in this section are no longer required. Skip down to the ‘Run Analysis’ section to start processing the data immediately.

With your server booted and ready to run, the next step is to configure RabbitMQ messaging to communicate between nodes on your cluster. In the AWS console, find the external and internal hostname of the head machine. Start by opening an ssh connection to the machine with the external hostname:

$ ssh -i your-keypair

Edit the /export/data/galaxy/universe_wsgi.ini configuration file to add the internal hostname. After editing, the AMQP section will look like:

host = ip-10-125-10-182.ec2.internal
port = 5672
userid = biouser
password = tester

Finally, add the user and virtual host to the running RabbitMQ server on the master node with 3 commands:

$ sudo rabbitmqctl add_user biouser tester
creating user "biouser" ...
$ sudo rabbitmqctl add_vhost bionextgen
creating vhost "bionextgen" ...
$ sudo rabbitmqctl set_permissions -p bionextgen biouser ".*" ".*" ".*"
setting permissions for user "biouser" in vhost "bionextgen" ...

Run analysis

With messaging in place, we are ready to run the analysis. /export/data contains a ready to run example exome analysis, with FASTQ input files in /export/data/exome_example/fastq and configuration information in /export/data/exome_example/config. Start the fully automated pipeline with a single command:

 $ cd /export/data/work
 $ /export/data/galaxy/post_process.yaml
                                   /export/data/exome_example/config/run_info.yaml starts processing servers on each of the cluster nodes, using SGE for scheduling. Then a top level analysis server runs, splitting the FASTQ data across the nodes at each step of the process:

  • Alignment with BWA
  • Preparation of merged alignment files with Picard
  • Recalibration and realignment with GATK
  • Variant calling with GATK
  • Assessment of predicted variant effects with snpEff
  • Preparation of summary PDFs for each sample with read details from FastQC alongside alignment, hybrid selection and variant calling statistics from Picard

Monitor the running process

The example data is from a human chromosome 22 hybrid selection experiment. While running, you can keep track of the progress in several ways. SGEs qstat command will tell you where the analysis servers are running on the cluster:

$ qstat
ob-ID  prior   name   user  state submit/start at   queue
1 0.55500 nextgen_an ubuntu  r  08/14/2011 18:16:32
2 0.55500 nextgen_an ubuntu  r  08/14/2011 18:16:32
3 0.55500 automated_ ubuntu  r  08/14/2011 18:16:47

Listing files in the working directory will show our progress:

$ cd /export/data/work
$ ls -lh
drwxr-xr-x 2 ubuntu ubuntu 4.0K 2011-08-13 21:09 alignments
-rw-r--r-- 1 ubuntu ubuntu 2.0K 2011-08-13 21:17
drwxr-xr-x 2 ubuntu ubuntu   33 2011-08-13 20:43 log
-rw-r--r-- 1 ubuntu ubuntu  15K 2011-08-13 21:17
-rw-r--r-- 1 ubuntu ubuntu  15K 2011-08-13 21:17
drwxr-xr-x 8 ubuntu ubuntu  102 2011-08-13 21:06 tmp

The files that end with .o* are log files from each of the analysis servers and provide detailed information about the current state of processing at each server:

$ less
INFO: nextgen_pipeline: Processing sample: Test replicate 2; lane
  8; reference genome hg19; researcher ; analysis method SNP calling
INFO: nextgen_pipeline: Aligning lane 8_100326_FC6107FAAXX with bwa aligner
INFO: nextgen_pipeline: Combining and preparing wig file [u'', u'Test replicate 2']
INFO: nextgen_pipeline: Recalibrating [u'', u'Test replicate 2'] with GATK

Retrieve results

The processing pipeline results in numerous intermediate files. These take up a lot of disk space and are not necessary after processing is finished. The final step in the process is to extract the useful files for visualization and further analysis:

$ /export/data/galaxy/post_process.yaml

For each sample, this script copies:

  • A BAM file with aligned sequeneces and original FASTQ data
  • A realigned and recalibrated BAM file, ready for variant calling
  • Variant calls in VCF format.
  • A tab delimited file of predicted variant effects.
  • A PDF summary file containing alignment, variant calling and hybrid selection statistics.

into an output directory for the flowcell: /export/data/galaxy/storage/100326_FC6107FAAXX:

$ ls -lh /export/data/galaxy/storage/100326_FC6107FAAXX/7
-rw-r--r-- 1 ubuntu ubuntu  38M 2011-08-19 20:50 7_100326_FC6107FAAXX.bam
-rw-r--r-- 1 ubuntu ubuntu  22M 2011-08-19 20:50 7_100326_FC6107FAAXX-coverage.bigwig
-rw-r--r-- 1 ubuntu ubuntu  72M 2011-08-19 20:51 7_100326_FC6107FAAXX-gatkrecal.bam
-rw-r--r-- 1 ubuntu ubuntu 109K 2011-08-19 20:51 7_100326_FC6107FAAXX-snp-effects.tsv
-rw-r--r-- 1 ubuntu ubuntu 827K 2011-08-19 20:51 7_100326_FC6107FAAXX-snp-filter.vcf
-rw-r--r-- 1 ubuntu ubuntu 1.6M 2011-08-19 20:50 7_100326_FC6107FAAXX-summary.pd

As suggested by the name, the script can also integrate the data into a Galaxy instance if desired. This allows biologists to perform further data analysis, including visual inspection of the alignments in the UCSC browser.

Learn more

All components of the pipeline are open source and part of community projects. CloudMan, CloudBioLinux and the pipeline are customized through YAML configuration files. Combined with the CloudMan managed SGE cluster, the pipeline can be applied in parallel to any number of samples.

The overall goal is to share the automated infrastructure work that moves samples from sequencing to being ready for analysis. This allows biologists more rapid access to the processed data, focusing attention on the real work: answering scientific questions.

If you’d like to hear more about CloudBioLinux, CloudMan and the exome sequencing pipeline, I’ll be discussing it at the AWS Genomics Event in Seattle on September 22nd.

Written by Brad Chapman

August 19, 2011 at 5:33 pm

Parallel upload to Amazon S3 with python, boto and multiprocessing

with 41 comments

One challenge with moving analysis pipelines to cloud resources like Amazon EC2 is figuring out the logistics of transferring files. Biological data is big; with the rapid adoption of new machines like the HiSeq and decreasing sequencing costs, the data transfer question isn’t going away soon. The use of Amazon in bioinformatics was brought up during a recent discussion on the BioStar question answer site. Deepak’s answer highlighted the role of parallelizing uploads and downloads to ease this transfer burden. Here I describe a method to improve upload speed by splitting over multiple processing cores.

Amazon Simple Storage System (S3) provides relatively inexpensive cloud storage with their reduced redundancy storage option. S3, and all of Amazon’s cloud services, are accessible directly from Python using boto. By using boto’s multipart upload support, coupled with Python’s built in multiprocessing module, I’ll demonstrate maximizing transfer speeds to make uploading data less painful. The script is available from GitHub and requires the latest boto from GitHub (2.0b5 or better).

Parallel upload with multiprocessing

The overall process uses boto to connect to an S3 upload bucket, initialize a multipart transfer, split the file into multiple pieces, and then upload these pieces in parallel over multiple cores. Each processing core is passed a set of credentials to identify the transfer: the multipart upload identifier (, the S3 file key name (mp.key_name) and the S3 bucket name (mp.bucket_name).

import boto

conn = boto.connect_s3()
bucket = conn.lookup(bucket_name)
mp = bucket.initiate_multipart_upload(s3_key_name, reduced_redundancy=use_rr)
with multimap(cores) as pmap:
    for _ in pmap(transfer_part, ((, mp.key_name, mp.bucket_name, i, part)
                                  for (i, part) in
                                  enumerate(split_file(tarball, mb_size, cores)))):

The split_file function uses the unix split command to divide the file into sections, each of which will be uploaded separately.

def split_file(in_file, mb_size, split_num=5):
    prefix = os.path.join(os.path.dirname(in_file),
                          "%sS3PART" % (os.path.basename(s3_key_name)))
    split_size = int(min(mb_size / (split_num * 2.0), 250))
    if not os.path.exists("%saa" % prefix):
        cl = ["split", "-b%sm" % split_size, in_file, prefix]
    return sorted(glob.glob("%s*" % prefix))

The multiprocessing aspect is managed using a contextmanager. The initial multiprocessing pool is setup, using a specified number of cores, and configured to allow keyboard interrupts. We then return a lazy map function (imap) which can be used just like Python’s standard map. This transparently divides the function calls for each file part over all available cores. Finally, the pool is cleaned up when the map is finished running.

def multimap(cores=None):
    if cores is None:
        cores = max(multiprocessing.cpu_count() - 1, 1)
    def wrapper(func):
        def wrap(self, timeout=None):
            return func(self, timeout=timeout if timeout is not None else 1e100)
        return wrap = wrapper(
    pool = multiprocessing.Pool(cores)
    yield pool.imap

The actual work of transferring each portion of the file is done using two functions. The helper function, mp_from_ids, uses the id information about the bucket, file key and multipart upload id to reconstitute a multipart upload object:

def mp_from_ids(mp_id, mp_keyname, mp_bucketname):
    conn = boto.connect_s3()
    bucket = conn.lookup(mp_bucketname)
    mp = boto.s3.multipart.MultiPartUpload(bucket)
    mp.key_name = mp_keyname = mp_id
    return mp

This object, together with the number of the file part and the file itself, are used to transfer that section of the file. The file part is removed after successful upload.

def transfer_part(mp_id, mp_keyname, mp_bucketname, i, part):
    mp = mp_from_ids(mp_id, mp_keyname, mp_bucketname)
    print " Transferring", i, part
    with open(part) as t_handle:
        mp.upload_part_from_file(t_handle, i+1)

When all sections, distributed over all processors, are finished, the multipart upload is signaled complete and Amazon finishes the process. Your file is now available on S3.

Parallel download

Download speeds can be maximized by utilizing several existing parallelized accelerators:

Combine these with the uploader to build up a cloud analysis workflow: move your data to S3, run a complex analysis pipeline on EC2, push the results back to S3, and then download them to local machines. Please share other tips and tricks you use to deal with Amazon file transfer in the comments.

Written by Brad Chapman

April 10, 2011 at 1:27 pm

Posted in analysis

Tagged with , , ,

Python query interface to BioGateway SPARQL endpoint and InterMine

with one comment

I spent the last week in Tokyo, Japan at BioHackathon 2010, an extremely productive meet up of open source coders organized by Toshiaki Katayama and the great folks at the Database Center of Life Sciences (DBCLS). The focus of the week was improving biological toolkits for accessing the Semantic Web of linked data.

The technical focus was on RDF (Resource Description Format), a standard way to represent data as a triple; each triple is made up of a subject defining the item, a predicate specifying a relationship, and a object representing the linked data. By providing data in RDF along with common naming schemes for common objects we facilitate linking biological data in mashups, expanding the ability of researchers to discover relationships between disparate data sources.

RDF is managed in data stores like Virtuoso, which are equivalent to relational or document based databases. For programmers, the primary technology for querying these stores is SPARQL, a query language similar to SQL. The goal of the Biopython programming team at the Hackathon was to provide an easy to use Python library to query RDF stores through SPARQL.

Interface design

The python interface organizes common biological queries on a datastore without exposing the backend implementation details. The final interface simplifies the process of making a query to two steps:

  • Instantiate a query builder, providing it with two sets of data: attributes to retrieve and items to filter against. This is modeled after BioMart querying and the R biomaRt interface, providing a generic, well understood way to specify the query.
  • Pass the query builder to a retrieval server which executes the query and returns the results in a tabular format, as a numpy RecordArray.

The user is never exposed to the underlying implementation details, as the library performs the work of building the query, submitting it to the remote server and reformatting the results.

Two interfaces were implemented at BioHackathon 2010:

  • BioGateway — a SPARQL endpoint to a RDF data store containing SwissProt data semantically linked to Gene Ontology (GO) terms.
  • InterMine — a XML based interface to a traditional relational database backend, containing organized metadata for primary experimental data from many organisms.

By providing a common interface for both semantic data and more traditional data sources, we hope to facilitate the conversion by data providers to RDF where it simplifies their backend storage and queries. Users of this high level interface do not need to worry about the underlying implementation, instead focusing resources on developing their biological queries.


BioGateway organizes the SwissProt database of protein sequences along with Gene Ontology Annotations (GOA) into an integrated RDF database. Data access is provided through a SPARQL query endpoint, allowing searches for proteins based on a combination of GO and SwissProt data.

This query searches for proteins that are involved in insulin response and linked to diabetes. The protein name, other proteins associated via protein-protein interaction, and the gene name are retrieved:

from systemsbio import Biogateway, UniProtGOQueryBuilder
builder = UniProtGOQueryBuilder("Homo sapiens")
builder.add_filter("GO_term", "insulin")
builder.add_filter("disease_description", "diabetes")
builder.add_attributes(["protein_name", "interactor", "gene_name"])
server = Biogateway()
results =
print len(results), results.dtype.names
result = results[0]
print result['protein_name'], result['gene_name'], \
      result['interactor'], result['GO_term']

An orthogonal search approach is to start with a protein of interest and retrieve linked details. Here we identify primary journal papers about a protein:

from systemsbio import Biogateway, ReferenceBuilder
builder = ReferenceBuilder()
builder.add_filter("protein_id", "1433B_HUMAN")
server = Biogateway()
results =
print len(results), results.dtype.names
result = results[0]
print result['protein_id'], result['reference']

Using the associated PubMed IDs, we can retrieve the paper details using Biopython and NCBI Entrez, utilizing BioGateway as part of an automated analysis pipeline:

from Bio import Entrez = ""
pubmed_id = result['reference'].replace("PMID_", "")
handle = Entrez.esummary(db="pubmed", id=pubmed_id)
record =[0]
print record['Title']
print record['PubDate']
print ",".join(record['AuthorList'])
print record['FullJournalName'], record['Volume'], record['Pages']
# Novel raf kinase protein-protein interactions found by an exhaustive yeast two-hybrid analysis.
# 2003 Feb
# Yuryev A,Wennogle LP
# Genomics 81 112-25

The full source code is available from GitHub: The implementation builds a SPARQL query based on the provided attributes and filters, using SPARQLwrapper to interact with the remote server and parse the results.


InterMine is an open source data management system used to power databases of primary research results like FlyMine and modMine. It stores metadata associated with projects in a structured way, enabling searches to identify data submissions of interest to biologists. It contains two useful web based tools to facilitate these searches:

  • Templates — Pre-defined queries that capture common ways biologists search the database.

  • Query builder — A graphical interface to define custom queries, allowing manual discovery of attributes of interest.

We access InterMine programatically using the same builder and server paradigms used in our BioGateway interface. The query below searches modMine for C. elegans experiments characterizing the H3K4Me3 histone modification, which is associated with chromatin structure in active genes. The returned submission identifiers can be used to examine the primary data associated with the experiment:

from intermine import Intermine, SubmissionQueryBuilder
builder = SubmissionQueryBuilder()
    "submission_title", "developmental_stage"])
builder.add_filter("organism", "Caenorhabditis elegans")
builder.add_filter("antibody_name", "H3K4me3")
server = Intermine("")
table =
print table.dtype.names
print table
# ('submission_id', 'submission_title', 'developmental_stage')
# [('176', 'Histone_H3K4me3_from_N2_L3_larvae_(AR0169_H3K4me3_N2_L3)', '')
# ('2311', 'Histone_H3K4me3_N2_EEMB_(WA30534819_H3K4ME3_N2_EEMB)',
# 'Early Stage Embryos')
# ('2410', 'Histone_H3K79me1_N2_EEMB_(AB2886_H3K79ME1361912_N2_EEMB)',
# 'Early Stage Embryos')]

An advantage of defining query builders is that we can provide custom functionality to access more complex queries. The code below searches for C. elegans ChIP-seq experiments using a free text search. The implementation searches for the query term against several description fields in the database, hiding these details from the user:

from intermine import Intermine, ExperimentQueryBuilder
builder = ExperimentQueryBuilder()
builder.add_attributes(["submission_id", "experiment_name"])
builder.add_filter("organism", "Caenorhabditis elegans")
server = Intermine("")
table =
print table.dtype.names
print table
# ('submission_id', 'experiment_name')
# [('582', 'ChIP-Seq Identification of C. elegans TF Binding Sites')
# ('584', 'ChIP-Seq Identification of C. elegans TF Binding Sites')
# ...

The full source code is available from GitHub: It builds a XML query for submission to InterMine Web Services, handling submitting and parsing the result.


It is not a coincidence that a diverse set of tools like InterMine, BioGateway and BioMart were used in building these interfaces. The collaborative environment at BioHackathon 2010 facilitated productive discussions with the authors of these projects, leading to the API development and implementation. If you are interested in more details about the week of programming, check out the day to day summaries:

You are invited to fork and extend the code on GitHub.

Written by Brad Chapman

February 15, 2010 at 11:33 am

Automated retrieval of expression data with Python and R

with 10 comments

In February, I’ll be headed to Japan for BioHackathon 2010, which focuses on integrating biological datasets. In preparation, I’ve been thinking about use cases: when working with a set of data, what other information do you want to able to readily retrieve that will help you better answer your questions? One common set of data is gene expression information; you identify a group of genes of interest and are curious about their expression levels.

In an ideal semantic world, you would be able to build a query through a set of RDF triples like:

  • Expression set -> has organism -> Mus musculus
  • Mus musculus -> has cell type -> proB
  • proB -> has biological state -> wild type
  • Expression set -> has identifier -> A RefGene ID
  • A RefGene ID -> has expression value -> 7.65

So you could find expression sets unique to your cell type under standard conditions, and then identify the expression values of your genes. Ideally, expression sets would also have additional meta-data associated with them so you’d know what 7.65 meant in the context of the experiment.

Although we can’t do this query through semantic channels, there are a number of toolkits that help us answer the question in an automated way. NCBI’s Gene Expression Omnibus (GEO) provides an online resource hosting expression data. You can query the data sets with EUtils web API using Biopython. Once a data set of interest is identified, Bioconductor’s GEOquery can retrieve the data sets and mappings to UCSC RefGene identifiers. All of this is tied together with the Rpy2 python interface to R.

At the top level, we define our goal, which is to retrieve expression data for mouse proB wild type cells:

def main():
    organism = "Mus musculus"
    cell_types = ["proB", "ProB", "pro-B"]
    email = ""
    save_dir = os.getcwd()
    exp_data = get_geo_data(organism, cell_types, email, save_dir,

def _is_wild_type(result):
    """Check if a sample is wild type from the title.
    return result.samples[0][0].startswith("WT")

We query GEO for the possible data sets based on our organism and cell type, and then search through the results for a dataset that matches our criteria. In real life work, your personal _is_wild_type function may be an interactive dialog that presents the biologist with possible experiments, allowing them to select the one of interest. Note also that we store our final result locally, as a Python pickle. Practically, this reduces our load on external servers by only querying them once:

def get_geo_data(organism, cell_types, email, save_dir, is_desired_result):
    save_file = os.path.join(save_dir, "%s-results.pkl" % cell_types[0])
    if not os.path.exists(save_file):
        results = cell_type_gsms(organism, cell_types, email)
        for result in results:
            if is_desired_result(result):
                with open(save_file, "w") as out_handle:
                    cPickle.dump(result, out_handle)

    with open(save_file) as save_handle:
        result = cPickle.load(save_handle)

The hard work of querying GEO and retrieving the results is done by Biopython’s Entrez interface. After building up the query, the results are parsed into simple objects with a description of the expression set along with titles and identifiers for each of the samples that match our cell type:

def cell_type_gsms(organism, cell_types, email):
    """Use Entrez to retrieve GEO entries for an organism and cell type.
    """ = email
    search_term = "%s[ORGN] %s" % (organism, " OR ".join(cell_types))
    print "Searching GEO and retrieving results: %s" % search_term
    hits = []
    handle = Entrez.esearch(db="gds", term=search_term)
    results =
    for geo_id in results['IdList']:
        handle = Entrez.esummary(db="gds", id=geo_id)
        summary =
        samples = []
        for sample in summary[0]['Samples']:
            for cell_type in cell_types:
                if sample['Title'].find(cell_type) >= 0:
                    samples.append((sample['Title'], sample['Accession']))
        if len(samples) > 0:
            hits.append(GEOResult(summary[0]['summary'], samples))
    return hits

Once the experiment is selected we’ve accomplished the first part of our goal. Next, the plan is to get expression data values. We do this by building up a dictionary that maps RefGene IDs to expression values, so the results look like:

WT ProB, biological rep1 
[('NM_177327', [7.7430266269999999, 6.4795213670000003, 8.8766985500000004]), 
 ('NM_001008785', [7.4671954649999996, 5.4761453329999998]),
 ('NM_177325', [7.3312364040000002, 11.831475960000001]),
 ('NM_177323', [6.9779868059999997, 6.3926399939999996]),
 ('NM_177322', [5.0833683379999997])]

The actual hard work is retrieving the expression data and mappings to gene identifiers, and this is done using the GEOquery library in Bioconductor. Mapping information for expression values, and high level meta-data about the experiment, are stored in local files. The expression information is stored as a R style tab delimited data table, while the meta-data is stored in JSON format. Both sets of data are then present locally in readily readable formats for further processing:

def _write_gsm_map(self, gsm_id, meta_file, table_file):
    """Retrieve GEO expression values using Bioconductor, saving to a table.
    robjects.r.assign("", gsm_id)
    robjects.r.assign("table.file", table_file)
    robjects.r.assign("meta.file", meta_file)
      gsm <- getGEO(
      write.table(Table(gsm), file = table.file, sep = "\t", row.names = FALSE,
                  col.names = TRUE)
      cat(toJSON(Meta(gsm)), file = meta.file)

The previous function provides a mapping between probe names and expression levels. To be useful, we also need to map each of the probes on the expression array to biological gene identifiers. This is done through a second call to the GEO library to retrieve details for the expression platform. Again the data is presented in a R data frame, which we limit to only the items of interest and save as a tab delimited file:

def _write_gpl_map(self, gpl_id, gpl_file):
    """Retrieve GEO platform data using R and save to a table.
    robjects.r.assign("", gpl_id)
    robjects.r.assign("gpl.file", gpl_file)
      gpl <- getGEO( <- subset(Table(gpl), select=c("ID", "RefSeq.Transcript.ID"))
      write.table(, file = gpl.file, sep = "\t", row.names = FALSE,
                  col.names = TRUE)

Putting this all together, we download each of the mapping files and then parse them, building the connection: expression to probe ID to gene transcript IDs. The final dictionary maps these gene identifiers to the expression values and returns them:

def get_gsm_tx_values(self, gsm_id, save_dir):
    """Retrieve a map of transcripts to expression from a GEO GSM file.
    gsm_meta_file = os.path.join(save_dir, "%s-meta.txt" % gsm_id)
    gsm_table_file = os.path.join(save_dir, "%s-table.txt" % gsm_id)
    if (not os.path.exists(gsm_meta_file) or 
            not os.path.exists(gsm_table_file)):
        self._write_gsm_map(gsm_id, gsm_meta_file, gsm_table_file)

    with open(gsm_meta_file) as in_handle:
        gsm_meta = json.load(in_handle)
    id_to_tx = self.get_gpl_map(gsm_meta['platform_id'], save_dir)
    tx_to_vals = collections.defaultdict(list)
    with open(gsm_table_file) as in_handle:
        reader = csv.reader(in_handle, dialect='excel-tab') # header
        for probe_id, probe_val in reader:
            for tx_id in id_to_tx.get(probe_id, []):
    return tx_to_vals

The full script puts all of these parts together into a working example. In the end we go from our query to the expression data we need, although the route is a bit more circuitous then the idealized one described at the start of the post. Simplifying this type of question based approach to data access is one challenge bioinformatics tool developers will continue to face.

Written by Brad Chapman

January 2, 2010 at 5:43 pm

Gene Ontology analysis with Python and Bioconductor

with 9 comments

Last post we discussed extracting differentially expressed genes from a set of short read data. The output of this analysis is a large list of genes, and the next challenge is to try and extract interesting patterns from the results. Are genes in particular families over-represented? Are there a large number of genes with a particular function? These patterns can help direct your future experiments and provide an understandable grouping for biologists examining the data.

The Gene Ontology (GO) project provides a standardized set of terms describing the molecular function of genes. We will use the topGO package from the Bioconductor project to identify over-represented GO terms from a set of differentially expressed genes. Python will be used to prepare the data, utilizing rpy2 to call R for the statistical analysis.

The first step is to parse input files describing the differentially expressed genes and the mapping of gene names to GO terms. For the example we will use a simple CSV file from our previous analysis and an equally simple file describing the gene to GO mapping. For real life work, you will need to get the GO mapping file for your organism; for instance, here is the Arabidopsis GO mapping from TAIR.

The input file of differentially expressed genes is parsed into a dictionary where the keys are gene names and the values are associated p-values:

import csv

def parse_input_csv(in_handle):
    reader = csv.reader(in_handle) # header
    all_genes = dict()
    for (gene_name, _, _, pval) in reader:
        all_genes[gene_name] = float(pval)
    return all_genes

The GO mapping input file is parsed into two dictionaries: one mapping genes to GO terms for the GO analysis, and the second mapping GO terms to genes for displaying the results of the analysis:

import collections

def parse_go_map_file(in_handle, genes_w_pvals):
    gene_list = genes_w_pvals.keys()
    gene_to_go = collections.defaultdict(list)
    go_to_gene = collections.defaultdict(list)
    for line in in_handle:
        parts = line.split("\t")
        gene_id = parts[0]
        go_id = parts[1].strip()
        if gene_id in gene_list:
    return dict(gene_to_go), dict(go_to_gene)

Next we set up our associated R environment for GO analysis. This involves loading the topGO library and creating a function to classify genes as either differentially expressed, or not. We do this based on the p-value, classifying genes with a p-value of 0.01 or less as differentially expressed:

import rpy2.robjects as robjects

gene_pval = 1e-2
    topDiffGenes = function(allScore) {
      return (allScore < %s)
''' % gene_pval)

Now it’s time to run the analysis. The paramters are defined as a dictionary which initializes a topGOdata object:

  • ontology — the type of GO ontology to analyze; molecular function (MF) is examined here.
  • annot — how the GO terms are annotated; we specify they will be identified by a mapping of genes to GO ids, passed as the gene2GO argument.
  • geneSelectionFun — the function we defined above to determine if genes are differentially expressed.
  • allGenes — an R named vector where the names are genes and the values are p-values; we use a utility function to convert a python dictionary to the named vector.
  • gene2GO — an R named vector mapping gene names to associated GO terms.

The initialized object is run using Fisher classic analysis. Several different analysis methods and test statistics are available in the package. The resulting scores are extracted and a summary table of the results is generated:

def _dict_to_namedvector(init_dict):
    """Call R to create a named vector from an input dictionary.
    return robjects.r.c(**init_dict)

go_term_type = "MF"
topgo_method = "classic" # choice of classic, elim, weight
params = {"ontology" : go_term_type,
          "annot" : robjects.r["annFUN.gene2GO"],
          "geneSelectionFun" : robjects.r["topDiffGenes"],
          "allGenes" : _dict_to_namedvector(gene_vals),
          "gene2GO" : _dict_to_namedvector(gene_to_go)
go_data ="topGOdata", **params)
results = robjects.r.runTest(go_data, algorithm=topgo_method,
scores = robjects.r.score(results)
score_names = scores.getnames()
num_summarize = min(100, len(score_names))
results_table = robjects.r.GenTable(go_data, elimFisher=results,
        orderBy="elimFisher", topNodes=num_summarize)

The scores and results table are used to generate a list of over-represented GO terms with their associated p-values and descriptions:

# extract term names from the topGO summary dataframe
ids_to_terms = dict()
for index, go_id in enumerate(results_table[GO_ID_INDEX]):
    ids_to_terms[go_id] = results_table[TERM_INDEX][index]
go_terms = []
# convert the scores and results information info terms to return
for index, item in enumerate(scores):
    if item < go_pval:
        go_id = score_names[index]
        go_terms.append((item, go_id, ids_to_terms.get(go_id, "")))

Finally, we print out the resulting overexpressed GO terms along with associated genes:

def print_go_info(go_terms, go_term_type, go_to_gene):
    for final_pval, go_id, go_term in go_terms:
        genes = []
        for check_go in [go_id] + get_go_children(go_id, go_term_type):
            genes.extend(go_to_gene.get(check_go, []))
        genes = sorted(list(set(genes)))
        print "-> %s (%s) : %0.4f" % (go_id, go_term, final_pval)
        for g in genes:
            print g

One tricky part of the post-analysis processing is finding all of the genes associated with an identified GO id. The GO terminology is hierarchical, so a resulting GO identifier like GO:0042578 in the example results file can represent both that term and more specific terms. A nice way to illustrate this for yourself is to look at the AmiGO browser display for the term.

The result is that genes associated with an over-represented term will come from both the identified term and more specific terms. Bioconductor has an interface to the GO database, so we can extract all of the relevant terms by moving down the GO tree collecting the more specific children, given the parent term:

def get_go_children(go_term, go_term_type):
    child_map = robjects.r["GO%sCHILDREN" % (go_term_type)]
    children = []
    to_check = [go_term]
    while len(to_check) > 0:
        new_children = []
        for check_term in to_check:
            new_children.extend(list(robjects.r.get(check_term, child_map)))
        new_children = list(set([c for c in new_children if c]))
        to_check = new_children
    children = list(set(children))
    return children

The full script pulls all of this code together into a working example. You will need to update the parsers to match your input and GO mapping file, but otherwise the code can plug directly into full genome analyses of differential expression data.

Written by Brad Chapman

October 18, 2009 at 11:03 am

Differential expression analysis with Bioconductor and Python

with 10 comments

This post demonstrates performing differential expression analysis of short read sequencing data using a combination of Python and the R statistical language. Python is used a glue language to manipulate and prepare count data from short read sequencing. R and the Bioconductor package are used to perform the statistical analysis. The excellent rpy2 package connection Python and R.

Scientists often need to compare expression results from multiple experimental conditions. This can be done with microarray analysis and the wide variety of associated statistical tests, but many researchers are now utilizing short read sequencing. Experiments identify transcript prevalence through digital gene expression under multiple conditions, and we are interested in extracting statistically meaningful differences from the results. Similarly, we may be interested in differences resulting from short RNA discovery or other next generation sequencing applications.

Short read differential expression analysis differs from microarray analyses as it is based on counts instead of intensity scores. The edgeR package in Bioconductor fits count data to a statistical model considering sample to sample variation, and performs Fisher’s exact test to provide p-values associated with changes in expression between samples.

Here we will consider the case of a two experiment analysis. A simple Example CSV file has counts for 5 different genes under 2 conditions and total count information for the full experiment. We read the file into a dictionary of counts keyed by each condition and the gene name:

import csv
import collections

def read_count_file(in_file):
    """Read count information from a simple CSV file into a dictionary.
    counts = collections.defaultdict(dict)
    with open(in_file) as in_handle:
        reader = csv.reader(in_handle)
        header =
        conditions = header[1:]
        for parts in reader:
            region_name = parts[0]
            region_counts = [float(x) for x in parts[1:]]
            for ci, condition in enumerate(conditions):
                counts[condition][region_name] = region_counts[ci]
    return dict(counts)

The dictionary is organized into NumPy matrices; NumPy is a powerful numerical package for Python which integrates smoothly with our rpy2 interface to import directly into R. Here we organize our conditions and genes and then push the count data into a matrix where the columns are conditions and the rows are genes; each item is a count value. This is returned along with associated data:

  • groups — A list of experimental groups, which can be used to analyze replicates. Here, two groups with a single replicate each are defined.
  • sizes — The total number of counts for each experiment. This is extracted from the “Total” row of the CSV count file, and will be used for normalization of during the statistical analysis.

import numpy

def get_conditions_and_genes(work_counts): 
    conditions = work_counts.keys()
    all_genes = []
    for c in conditions:
    all_genes = list(set(all_genes))
    sizes = [work_counts[c]["Total"] for c in conditions]
    return conditions, all_genes, sizes
def edger_matrices(work_counts):
    conditions, all_genes, sizes = get_conditions_and_genes(work_counts)
    assert len(sizes) == 2
    groups = [1, 2]
    data = []
    final_genes = []
    for g in all_genes:
        cur_row = [int(work_counts[c][g]) for c in conditions]
        if sum(cur_row) > 0:
    return (numpy.array(data), numpy.array(groups), numpy.array(sizes),
            conditions, final_genes)

The organized data is now ready to be pushed directly into an edgeR Bioconductor analysis using the rpy2 interface. Three R functions are called: the data matrices are organized into a DGEList object, this object is passed to deDGE which does the actual digital gene expression analysis, and finally topTags is called to retrieve a vector of differential expression p-values. The vector is translated into a Python object from which we extract the p-values and re-organize them by the initial gene indexes. The ordered p-values are then returned.

import rpy2.robjects as robjects
import rpy2.robjects.numpy2ri

def run_edger(data, groups, sizes, genes):
    params = {'group' : groups, 'lib.size' : sizes}
    dgelist = robjects.r.DGEList(data, **params)
    ms = robjects.r.deDGE(dgelist, doPoisson=True)
    tags = robjects.r.topTags(ms, pair=groups, n=len(genes))
    indexes = [int(t) - 1 for t in tags.rownames()]
    pvals = list(tags.r['adj.P.Val'][0])
    assert len(indexes) == len(pvals)
    pvals_w_index = zip(indexes, pvals)
    assert len(pvals_w_index) == len(indexes)
    return [p for i,p in pvals_w_index]

The final results are written to a CSV file with our ordered genes, conditions and p-value probabilities. Each gene is associated with the initial count data and p-values, the genes are sorted by p-value with the most differentially expressed genes placed first, and the ordered information is written out:

def write_outfile(outfile, genes, conditions, work_counts, probs):
    with open(outfile, "w") as out_handle:
        writer = csv.writer(out_handle)
        writer.writerow(["Region"] +
                ["%s count" % c for c in conditions] + ["edgeR p-value"])
        out_info = []
        for i, gene in enumerate(genes):
            counts = [int(work_counts[c][gene]) for c in conditions]
            out_info.append((probs[i], [gene] + counts))
        [writer.writerow(start + [prob]) for prob, start in out_info]

Our example output file shows the results. The full script is available for you to use and customize for your own analyses. This example demonstrates the power of the rpy2 interface. Input and output data is happily manipulated in a familiar language such as Python, and can be seamlessly integrated with excellent statistical computation in Bioconductor and other specialized R packages.

Written by Brad Chapman

September 13, 2009 at 8:28 pm

Trimming adaptors from short read sequences

with 11 comments

One common post processing step for next generation, or short read, sequencing is to remove adaptor sequences. Sample preparation for sequencing often involves selection of a particular fraction of interest like expressed genes, microRNAs, or copy number variant regions. Selecting, amplifying, and sequencing these regions can result in read sequences containing a common tag that needs to be identified and removed before continuing with other analysis steps like alignment to the genome. This is often the case when the reads of interest are small like in short RNAs or small sequence tags.

We start with a short known sequence (the adaptor) that will be present in our sequences of interest. For each read, we want to determine the position of this adaptor and remove it. Additionally, reads may contain the adaptor but differ by one or more bases; these differences are likely due to sequencing errors. So our solution needs to allow some fuzzy matching of the adaptor to the read sequences.

My first attempt involved using fuzzy string matching. In Python, several libraries exist to tackle this problem. The normal use case is in identifying misspelled words in text. For instance, you can calculate Levenshtein (edit) distance or use difflib in the standard library; Stack Overflow has a nice discussion of the different modules. Ultimately this approach proved to be a dead end; the properties that make English words similar do not overlap well with those differentiate DNA sequences.

That realization led me to tackle the problem by computing local alignments of the adaptor to each sequence. This effectively captures the biological intuition you need to trim a sequence with less than a specified number of errors. The downside to this rigorous approach is that it will be slower than purely string based methods.

The Biopython library contains a Python local alignment function suitable for quick alignment of short regions. We use this to align an adaptor region to a sequence and calculate the number of differences in the aligned region. To handle a common case where we can find the exact adaptor sequence, we first do an string match. This avoids the alignment overhead in many cases:

from Bio import pairwise2

def trim_adaptor(seq, adaptor, num_errors, right_side=True):
    gap_char = '-'
    exact_pos = str(seq).find(adaptor)
    if exact_pos >= 0:
        seq_region = str(seq[exact_pos:exact_pos+len(adaptor)])
        adapt_region = adaptor
        seq_a, adaptor_a, score, start, end = pairwise2.align.localms(str(seq),
                str(adaptor), 5.0, -4.0, -9.0, -0.5,
                one_alignment_only=True, gap_char=gap_char)[0]
        adapt_region = adaptor_a[start:end]
        seq_region = seq_a[start:end]
    matches = sum((1 if s == adapt_region[i] else 0) for i, s in
    # too many errors -- no trimming
    if (len(adaptor) - matches) > num_errors:
        return seq
    # remove the adaptor sequence and return the result
        return _remove_adaptor(seq, seq_region.replace(gap_char, ""),

If the alignment contains fewer than the specified number of errors we’ve found an adaptor and remove it, returning the trimmed sequence. The attribute right_side allows us to specify whether the trimming should be done on the right (3′ end) or the left (5′ end):

def _remove_adaptor(seq, region, right_side=True):
    """Remove an adaptor region and all sequence to the right or left.
    if right_side:
            pos = seq.find(region)
        # handle Biopython SeqRecords
        except AttributeError:
            pos = seq.seq.find(region)
        return seq[:pos]
            pos = seq.rfind(region)
        # handle Biopython SeqRecords
        except AttributeError:
            pos = seq.seq.rfind(region)
        return seq[pos+len(region):]

Here is an example script using this function along with Biopython to trim reads from a FASTQ format file.Each read is analyzed to determine if it contains the adaptor, with up to 2 errors. If the adaptor is identified and trimmed, the final useful read is written to a FASTA format output file.

from __future__ import with_statement
import sys
import os

from Bio import SeqIO

from adaptor_trim import trim_adaptor

def main(in_file):
    adaptor_seq = "TCGTATGCCGTCTTCTGC"
    num_errors = 2
    base, ext = os.path.splitext(in_file)
    out_file = "%s-trimmed.fa" % (base)
    with open(in_file) as in_handle:
        with open(out_file, "w") as out_handle:
            for rec in SeqIO.parse(in_handle, "fastq"):
                trim = trim_adaptor(rec.seq, adaptor_seq, num_errors)
                if len(trim) > 0 and len(trim) < len(rec):
                    rec.letter_annotations = {}
                    rec.seq = trim
                    SeqIO.write([rec], out_handle, "fasta")

You can get the full trimming code, along with tests, from GitHub. The script takes about a half hour to process 15 million 36bp reads. Other current trimming approaches are often integrated into the aligners themselves: for instance, NovoAlign appears to have similar alignment based trimming capabilities.

Written by Brad Chapman

August 9, 2009 at 8:18 pm

Posted in OpenBio

Tagged with , , ,

Sorting genomic alignments using Python

leave a comment »

Recent, I was asked to sort a MAF (Multiple Alignment Format) file containing genomic alignments. This was a large vertebrate alignment file and the researchers were interested in manually examining the longest matches. The tricky part of doing this is that the entire file cannot comfortably fit in memory on a standard machine.

The common solution to dealing with large memory intensive files is to pre-parse the file and provide a way to look up individual records based on an identifier. For instance, items could be pushed into a relational database table and retrieved based on primary keys. While moving to disk access is less efficient, it allows your code to scale to these type of real life problems without resorting to buying large memory machines.

We will use the bx-python library to read the MAF alignment files, prepare an index for accessing the alignments individually, and ultimately perform the sorting.

The first step is to define the index retrieval method. bx-python has index retrival functionality that allows querying and retrieving records based on genomic location. We don’t need that advanced query capability here, and can retrieve records based on position in the file as well. We build the index by looping through each record and recording the file position and genomic locations in the index.

from bx.align import maf
from bx import interval_index_file

def build_index(in_file, index_file):
    indexes = interval_index_file.Indexes()
    with open(in_file) as in_handle:
        reader = maf.Reader(in_handle)
        while 1:
            pos = reader.file.tell()
            rec =
            if rec is None:
            for c in rec.components:
                indexes.add(c.src, c.forward_strand_start,
                        c.forward_strand_end, pos, max=c.src_size )

    with open(index_file, "w") as index_handle:

Now we have a unique index to retrieve a record — the position in the file. The second step is to loop through the file again and build a list of alignment sizes and positions. Alignment size is the first value in each list item, since this is the criteria to sort by. This list easily fits in memory, and we sort it with the largest alignments first.

rec_info = []
with open(in_file) as in_handle:
    reader = maf.Reader(in_handle)
    while 1:
        pos = reader.file.tell()
        rec =
        if rec is None:
        rec_info.append((rec.text_size, pos))

Finally, we loop though the sorted list of sizes and use the associated positions to get records from our index. Each record is then written to an output file.

index = maf.Indexed(in_file, index_file)
with open(out_file, "w") as out_handle:
    writer = maf.Writer(out_handle)
    for size, pos in rec_info:
        rec = index.get_at_offset(pos)

The full script puts this together into a usable program. This could be adjusted to deal with other alignment formats supported by bx-python, like axt files.

Written by Brad Chapman

July 26, 2009 at 3:52 pm

Posted in OpenBio

Tagged with , ,

Thoughts from BOSC 2009; Python in bioinformatics

with 6 comments

I’d like to share my thoughts on two major themes that emerged from my trip to Sweden for the 2009 Bioinformatics Open Source Conference (BOSC). I talked briefly on publishing biological data on the web; you can check out the slides on Slideshare. This lead to discussion with several folks from the open source and Python bioinformatics communities. The major themes of these conversations were: organization within the Python bioinformatics community and growth of a platform for developing web enabled applications.

Python in Bioinformatics

One of the unique elements of the Python bioinformatics community is that the work is distributed amongst several different packages. Unlike Perl, where many programmers regularly consolidate their code into the BioPerl project, the Python community has settled on a few different packages: Biopython, bx-python, pygr and PyCogent are a few popular ones. Instead of working with a monolithic code base, users can pick functionality amongst several choices.

This distinctive organization is not a bad thing. It avoids creating an unwieldy package which can be hard to maintain and re-factor. It allows programmers to explore solutions in ways better suited to their particular problems. Finally, it provides individual recognition for the hard work researchers put into building and maintaining reusable code.

What the distribution does mean is that the Python bioinformatics community needs to work harder at communication and coordination. One great idea was to write a grant to help bring Python biology programmers together for a conference and hacking session, ideally alongside a conference like BOSC or SciPy. This would provide an impetus for contributors to learn and discuss each others platforms. Beyond goodwill and community, the deliverables would be documentation and code contributing to integration between projects. This would enable scientists by lowering the learning curves to producing useful biology related code in Python.


My talk focused on developing small, reusable presentation and backend components for building web enabled applications. One really insightful question asked whether the community should focus on building a platform these components could be plugged into, or if components themselves would eventually evolve towards a larger structure. The first idea was taken very successfully by the Firefox browser with their plugin architecture, which allows end users to build amazing web interfacing applications within the browser. The second approach is the one I am more used to taking: build relatively smaller things that work, with an eye towards integration.

A discussion with James Taylor convinced me that it was worthwhile to take a longer look at the Galaxy project. Galaxy is an excellent web based front end to many bioinformatics programs and scripts, allowing biologists to put together analysis pipelines. They have a powerful public site which means using Galaxy requires no installation for many use cases.

The code is also publicly available and you can run your own local Galaxy instances, plugging in custom programs through their XML based tool interface. The architecture of the system is remarkably similar to the one I converged on for my work. It features a Pylons like backend, SQLAlchemy for databases interactivity, and jQuery for the javascript interface. Deployment to cloud infrastructure can be done on Amazon EC2.

We have installed Galaxy locally and are taking it for a spin for our data presentation tasks. The tool plugin interface works as described and we have had good luck integrating it with custom input types. I will be trying more complex integrations with custom display and more Python code on the backend and hopefully have future posts covering that. Generally, I hope Galaxy can serve as a platform in which custom presentation code can be built, distributed, and reused.

I’d be happy to hear your thoughts about either the biology Python community or Galaxy as a platform for web presentation work.

Written by Brad Chapman

July 19, 2009 at 8:13 pm

Posted in OpenBio

Tagged with , ,