Blue Collar Bioinformatics

Note: new posts have moved to Please look there for the latest updates and comments

Archive for the ‘OpenBio’ Category

Summary from Bioinformatics Open Science Codefest 2013: Tools, infrastructure, standards and visualization

with 3 comments

The 2013 Bioinformatics Open Source Conference (BOSC) starts tomorrow in Berlin, Germany. It’s a yearly conference devoted to community-based software development projects supporting biological research. Members of the Open Bioinformatics Foundation discuss implementations and approaches to better provide interoperable and reusable software, libraries and pipelines.

For the past five years, a two day Codefest and hackathon preceded the conference. This gives programmers time to work face-to-face, sharing approaches and discovering connections between projects. This year, the the Department of Biology, Humboldt-Universität zu Berlin kindly hosted Codefest 2013. Thanks to the organizers and attendees, we finished projects ranging from tool development, infrastructure integration, standards development and visualization. There are photos of the Codefest in progress and a detailed writeup of projects.

Below we summarize the accomplishments from the two days. We welcome feedback on the topics covered and hope that by sharing our work we can encourage more programmers to become part of the open science bioinformatics community. Actively working to build well-tested, community-developed, interoperable tools is how we solve increasingly difficult research questions ranging from human health to plant breeding to microbial community function. The progress made in two days illuminates the effectiveness of open collaborative science.

Tool Development

BioRuby and BaseSpace – Develop SDK and apps for Illumina BaseSpace

Toshiaki Katayama, Raoul Bonnal, Eri Kibukawa, Joachim Baran, Dan MacLean, Fernando Izquierdo-Carrasco, Spencer Bliven

During the Codefest, we tested and documented our port of the BaseSpace Python SDK to Ruby. Ruby/Biogem developers can now easily utilize next-generation sequencing code within the Illumina’s BaseSpace framework. For non-Ruby programers, we found that it can be a burden to create new Web app from scratch on top of your NGS program. So we started new project to provide a Web-app scaffold for BaseSpace. We have already implemented the basic portion but will need some more time before releasing the BioBaseSpace application. The BaseSpace Ruby SDK was officially released: for more information, see Joachim’s blog post, the official announcement from the BioRuby team and the annoucement from Illumina.

Barrnap – Bacterial ribosomal RNA predictor

Torsten Seemann, Tim Booth

For the last 8 years RNAmmer has been the standard tool for predicting ribosomal RNA features in genomes, because it is reasonably fast, accurate, and works on bacteria and eukaryotes. Its drawbacks are that it relies on small, older databases; requires an older conflicting version of HMMER; and has restrictive licence terms. To resolve these issues we have implemented a new rRNA predictor which uses the new “nmmer” tool from HMMER 3.1 for searching DNA profiles against DNA sequence. We used the Silva and GreenGenes seed alignments for the 5S, 23S and 16S genes to build the profile models from. Barrnap is a small Perl script which takes FASTA as input, and outputs the rRNA feature predictions in GFF3 format. It will be packaged in Bio-Linux and replace RNAmmer in the Prokka bacterial annotation system.

BioJVM – Coordinating and integrating BioJava and ScaBio

Spencer Bliven, Andreas Prlic, Markus Gumbel

Both Java and Scala run on the Java Virtual Machine. As such, it makes sense to coordinate and document the various Bio* projects which run on the JVM and therefor can interoperate to some degree. We are able to successfully reference BioJava functions from Scala code and ScaBio functions from Java code. The ease of this process means that users can easily use both libraries from whichever language is more suited for their biological problem.


Peter Cock, Konstantin Tretyakov, Bin Zhang

The Biopython team worked on training new users at Codefest and exploring integration of Biopython with other Python molecular visualization toolkits like PyMol. Infrastructure development involved testing and debugging on multiple systems, including identifying and fixing Windows and PyPy problems. We also identified areas where we can make it easier to contribute to Biopython: specifically easing the process to report and fix bugs by moving to integrated GitHub issue tracking and working to support Biopython-associated projects with easy installation tools.

Galaxy Debianization

Tim Booth

I spent several hours revisiting previous work on the Galaxy package for Bio-Linux and made significant progress towards it being something that can go into Debian-proper. Results will be committed to Deb-Med public SVN and patches will be forwarded to the Galaxy dev mailing list.

Standards and Visualization

Ontology and provenance representation

Herve Menager, Bertrand Neron, Jackie Quinn, Stian Soiland-Reyes, Matus Kalas, Steffen Moller

The goal of this group was to investigate and implement solutions to use ontologies to help people find and use the programs and data they need for their work, and to help automate the integration of tools or data resources into workflows or workbenches. We also wanted to identify useful provenance metadata, to store in a rigorous way the conditions and configuration of analysis steps run by users. This improves transparency, reproducibility, and reliability of the scientific results.

We worked toward inclusion of the EDAM onotology as part of the Mobyle system’s built-in type and classification mechanisms. We created a user case by identify workflows in Mobyle and mapped the descriptions unto EDAM classification to allow mapping between the types. We also investigated the possibilities opened by projects such as PROV to standardize the provenance information stored by systems such as Mobyle. We added a prototype functionality to the development version of Mobyle that dynamically generates this provenance information in a JSON-based format.

Integrate DGE-Vis & Dalliance, JS animation scheduler

David Powell, Thomas Down, Skyler Brungardt, Alex Kalderimis

We worked on integrating two visualization tools: the Dalliance genome browser and the DGE-Vis RNA-seq explorer. We now have a proof-of-concept tool that makes it possible to visualise RNA-seq analysis while browsing the genome. This inspired a JavaScript scheduler that is able to schedule slow animation updates when the JavaScript engine is not busy, allowing smoother animations and more accurate windows. Finally, we added a JBrowse-compatible JSON backend for Dalliance for integration with Intermine.


Infrastructure management via CloudBioLinux (CBL)

Enis Afgan, John Chilton, Brad Chapman

  • Galaxy: We integrated custom installation procedures present in CBL with the Galaxy-tools versioned installation methodology.
  • Documentation: Due to the increased interest by individuals to use and contribute to CBL, we invested effort into creating purpose-driven documentation for CBL. This should help people use the endproduct of CBL, customize CBL their needs, as well as learn about the internals of CBL with the aim of contributing. We will finish and make the documentation available on ReadTheDocs over the coming months.
  • Build frameworks: We developed a simpler automated method to invoke the CBL build framework to help remove complex error prone steps.
  • Web tooling: In spirit of making CBL more accessible and easier to use, we’ve decided to tackle development of a lightweight webapp that helps with customizing and generating CBL configuration files.

Improve ipython cluster support and runtime metrics

Valentine Svensson, Guillermo Carrasco, Roman Valls, Per Unneberg

We worked to extend the Ipython parallel cluster framework to support additional schedulers, specifically implementing SLURM support to supplement existing SGE, LSF, Torque and Condor schedulers. We plan to extend this to allow generalized use of the DRMAA connector, ultimately port such generalization into ipython so that python scientific computations can be executed efficiently across different clusters implementation. Both Roman and Guillermo blogged detailed documentation of the work in progress.

We also worked to build a tool that helps provide run time estimations for bioinformatcs jobs (e.g. “how long should aligning 40 million reads against hg19 with BWA take if I use 8 cores?”). We plan to collaborate on longer term development of this with the Genome Comparison of Analytic Testing team.

GATK-based reusable pipeline based around Rubra/Ruffus

Clare Sloggett, Bernie Pope

We worked on code cleanup, documentation and test data for a reusable pipeline to handle variant calling and annotation, using Rubra built on the Ruffus framework. It handles BWA alignment, GATK alignment cleaning and variant calling and ENSEMBL annotation. To make these pipelines easier to run, we worked on integrating them into the GVF flavor in CloudBioLinux.

Written by Brad Chapman

July 18, 2013 at 6:26 pm

Posted in OpenBio

Tagged with , ,

Bioinformatics open source interoperability Hackathon at the Broad Institute

leave a comment »

Interoperability Hackathon

On April 7th and 8th, a group of biologists and programmers gathered at the Broad Institute to work on improving interoperability of open-source bioinformatics tools. Organized by the Open Bioinformatics Foundation and GenomeSpace team, this was part of the lead up to the Bioinformatics Open Source Conference (BOSC) in July in Berlin. The event is part of an ongoing series of coding sessions (Codefests or Hackathons) organized by the open bioinformatics community, which give programmers who typically work together remotely a chance to code and discuss in the same place for two days. These have been successful in both producing new code and in building connections which help sustain development of these community projects.

Goals and outcomes

One major challenge in analyzing biological data is interfacing multiple bioinformatics tools. Tools often work independently, and where general architectures like plugins or API exist they are often project specific. This results in isolated islands of data exchange, but transferring data or resources between tools requires work that is often rate-limiting or insurmountable.

Our goal at the hackathon was to provide simple APIs and implementations that help facilitate transfers between multiple islands of functionality. GenomeSpace does this by providing a central hub and API to push and pull from tools. We wanted to generalize this to support multiple tools, and build client implementations that demonstrate this in practice. The long term goal is to encourage tool developers to provide server side APIs compatible with the more general library, making extension of the connector toolkit easier. For developers, the client API would allow them to easily transfer files between multiple tools without needing to learn and implement the specific transfer APIs of each tool.

We called this high level client library Genome Connector (gcon, for short) and took a practical approach by implementing client libraries that provide a common interface to multiple tools: GenomeSpace, Galaxy, BaseSpace, 23andMe and general key-value stores through jClouds. To identify a reasonable amount of work for two days, we focused on file transfer: authentication, finding files, getting and putting files to remote analysis platforms. In addition we defined some critical components for doing biological work:

  • File metadata: We need to be able to store arbitrary key/value on objects to assign essential biological information necessary to interpret it, like organisms and genome build. In addition, metadata allows provenance and tracking of files by enabling annotation of files with history and processing steps.
  • Filesets: Large biological files have secondary files with indexes, allowing indexed retrieval of data (for example: read bam and bai, variant vcf and idx, tabix gz and tbi). To avoid expensive reindexing, we want to group and transfer these together.

We also identified other useful extensions that would help improve interoperability and facilitate building connected tools, like providing Publish/subscribe hooks to avoid having to poll servers for updates, and smarter approaches to sending data to avoid duplication and unnecessary transfer of data.

The output of our discussion and coding are common Genome Connector implementations in multiple languages. GitHub repositories are available for in-progress Java, Python and Clojure implementations. These wrap multiple diverse tools and expose them through a common top level API, allowing developers to push and pull data from multiple tools.

I’m immensely grateful to the incredible participants who generously donated their time and expertise to help with these projects. For anyone interested we also have detailed documentation on discussions during the hackathon.

Bioinformatics Open Source Conference

If you’re a bioinformatics programmers interested in open source coding and helping answer biological questions by improving usability and connectivity of tools, you’re welcome to join the OpenBio and BOSC communities. We’ve created a biological interoperability mailing list for additional discussion. The next BOSC conference is July 19th and 20th in Berlin, Germany as part of the ISMB conference. There will also be another two day Codefest proceeding BOSC on July 17th and 18th. Abstracts for talks at BOSC are due this Friday, April 12th. Looking forward to seeing everyone at future BOSC and coding events.

Written by Brad Chapman

April 8, 2013 at 3:25 pm

Posted in OpenBio

Tagged with , , ,

Parallel approaches in next-generation sequencing analysis pipelines

with 2 comments

My last post described a distributed exome analysis pipeline implemented on the CloudBioLinux and CloudMan frameworks. This was a practical introduction to running the pipeline on Amazon resources. Here I’ll describe how the pipeline runs in parallel, specifically diagramming the workflow to identify points of parallelization during lane and sample processing.

Incredible innovation in throughput makes parallel processing critical for next-generation sequencing analysis. When a single Hi-Seq run can produce 192 samples (2 flowcells x 8 lanes per flowcell x 12 barcodes per lane), the analysis steps quickly become limited by the number of processing cores available.

The heterogeneity of architectures utilized by researchers is a major challenge in building re-usable systems. A pipeline needs to support powerful multi-core servers, clusters and virtual cloud-based machines. The approach we took is to scale at the level of individual samples, lanes and pipelines, exploiting the embarassingly parallel nature of the computation. An AMQP messaging queue allows for communication between processes, independent of the system architecture. This flexible approach allows the pipeline to serve as a general framework that can be easily adjusted or expanded to incorporate new algorithms and analysis methods.

Process overview — points for parallel implementations

The first level of parallelization occurs during processing of each fastq lane. We split the file into individualized barcoded components, followed by alignment and BAM processing. The result is a sorted BAM file for each barcoded sub-sample, given a set of input fastq files:

Initial lane processing

The pipeline merges samples present in barcodes on multiple lanes, producing a single representative BAM file. The next step parallelizes the processing of each alignment file with read quality assessment, preparation for visualization and variant calling:

Sample processing overview

The variant calling steps utilize The Genome Analysis Toolkit (GATK) from the Broad Institute. It prepares alignments by recalibrating initial quality scores given the aligned sequences and consistently realigning reads around indels. The Unified Genotyper identifies variants from this prepared alignment file, then uses these variants along with known true sites for assigning quality scores and filtering to a final set of calls:

GATK variant calling details

Subsequent steps include assessment of variant effects using snpEff and haplotype phasing of variants in diploid organism analyses.

Messaging approach to parallel execution

The process diagrams illustrate points of parallel execution for each fastq file and sample analysis. Practically, a top level analysis server manages each of the sub-processes. A command line script, a LIMS system or a specialized Galaxy interface start this top level process. RabbitMQ messaging facilitates communication between the analysis controller and processing nodes:

Messaging approach

In my previous post, CloudMan manages this entire process. The web interface controls a pre-configured SGE cluster and a custom script starts the job on this cluster. However, the general nature of the pipeline architecture allows this to work equally well on multiple core machines or a heterogeneous set of connected machines.

The CloudMan work demonstrates that clusters, especially on-demand virtual images like those available from Amazon, are be a powerful way to scale analyses. Equally important, it provides an open platform to share these pipelines and encourage re-use. The code for the pipeline is available from the bcbio-nextgen GitHub repository

Written by Brad Chapman

September 10, 2011 at 3:12 pm

Next generation sequencing information management and analysis system for Galaxy

with 34 comments

Next generation sequencing technologies like Illumina, SOLiD and 454 have provided core facilities with the ability to produce large amounts of sequence data. Along with this increased output comes the challenge of managing requests and samples, tracking sequencing runs, and automating downstream analyses.

Our group at Massachusetts General Hospital approached these challenges by developing a sample submission and tracking interface on top of the web-based Galaxy data integration platform. It provides a front end for biologists to enter their sample details and monitor the status of a project. For lab technicians doing the sample preparation and sequencing work, the system tracks sample states via a set of progressive queues providing data entry points at each step of the process. On the back end, an automated analysis pipeline processes data as it arrives off the sequencer, uploading the results back into Galaxy.

This post will show videos of the interface in action, describe installation and extension of the system, and detail the implementation architecture.

Front-end usage

Researcher sample entry

Biologists use a local Galaxy server as an entry point to submit samples for sequencing. This provides a familiar interface and central location for both entering sample information and retrieving and analyzing the sequencing data.

Practically, a user begins by browsing to the sample submission page. There they are presented with a wizard interface which guides them through entry of sample details. Multiplexed samples are supported through a drag and drop interface.

When all samples are entered, the user submits them as a sequencing project. This includes billing information and a project name to facilitate communication between the researcher and the core group about submissions. Users are able to view their submissions grouped as projects and track the state of constructs. Since we support a number of services in addition to sequencing — like library construction, quantitation and validation — this is a valuable way for users to track and organize their requests.

Sequencing tracking and management

Administrators and sequencing technicians have access to additional functionality to help manage the internal sample preparation and sequencing workflow. The main sample tracking interface centers around a set of queues; each queue represents a state that a sample can be in. Samples move through the queues as they are processed, with additional information being added to the sample at each step. For instance, a sample in the ‘Pre-sequencing quantitation’ queue moves to the ‘Sequencing’ queue once it has been fully quantitated, with that quantitation information entered by the sequencing technician during the transition.

Assigning samples to flow cells occurs using a drag and drop jQueryUI interface. The design is flexible to allow for placing samples across multiple lanes or multiplexing multiple barcoded samples into a single lane.

Viewing sequencing results

Running a sequencing machine requires careful monitoring of results and our interface provides several ways to view this data. Raw cluster and read counts are linked to a list of runs. For higher level analyses, interactive plots are available for viewing reads over time and pass rates compared to read density. These allow adjustment of experimental procedures to maximize useful reads based on current machine chemistry.

Analysis pipeline

Utilizing a front end that organizes requests allows sequencing results to be processed through a fully automated analysis pipeline on the back end. The pipeline detects runs coming off of a sequencer, transfers files to storage and analysis machines and manages a number of processing steps:

In addition to the default analysis, a full SNP calling pipeline is included with:

Fastq reads, alignments files, summary PDFs and other associated files are uploaded back into into Galaxy Data Libraries organized by sample names. Users can download results for offline work, or import them directly into their Galaxy history for further analysis or display.

Uploaded analysis files in data library

Installing and extending

The code base is maintained as a Bitbucket repository that tracks the main galaxy-central distribution. It is updated from the main site regularly to maintain compatibility, with the future goal of integrating a generalized version into the main source tree. Detailed installation instructions are available for setting up the front-end client.

The analysis pipeline is written in Python and drives a number of open source programs; it is available as a GitHub repository with documentation and installation instructions.

We are using the current system in production and continue to develop and add features based on user feedback. We would like to generalize this for other research cores with additional instruments and services, and would be happy to hear from developers working on this type of system for their facilities.

Implementation details

This work would not have been possible without the great open source toolkits and frameworks that it builds on. Galaxy provides not only an analysis framework, but also a ready to use database structure for managing samples and requests. The front end builds off existing Galaxy sample tracking work, and requires only two new database storage tables.

The main change from the existing sample tracking framework is a generalization of the sample and request relationships. Requests can both contain samples, and be a part of samples so that a sequenced sample is organized as:

Request sample database architecture

By reusing and expanding the great work of the Galaxy team, we hope to eventually integrate useful parts of this work into the Galaxy codebase.

Written by Brad Chapman

January 11, 2011 at 10:05 am

Posted in OpenBio

Tagged with , ,

CloudBioLinux: progress on bioinformatics cloud images and data

with 4 comments

My last post introduced a framework for building bioinformatics cloud images, which makes it easy to do biological computing work using Amazon EC2 and other on-demand computing providers. Since that initial announcement we’ve had amazing interest from the community and made great progress with:

New software and data

The most exciting changes have been the rapid expansion of installed software and libraries. The goal is to provide an image that experienced developers will find as useful as their custom configured servers. A great group of contributors have put together a large set of programs and libraries; the configuration files have all the details on installed programs as well as libraries for Python, Perl, Ruby, and R. Another addition is support for non-packaged programs which provides software not yet neatly wrapped in a package manger or library-specific install system: next-gen software packages like Picard, GATK and Bowtie are installed through custom scripts.

To improve accessibility for developers who prefer a desktop experience, a FreeNX server was integrated with the provided images. Tim Booth from the NEBC Bio-Linux team headed up the integration of FreeNX, and the user experience looks very similar to a locally installed Bio-Linux desktop.

In addition to the software image, a publicly available data volume is now available that contains:

  • Genome sequences pre-indexed for search with next-gen aligners like Bowtie, Novoalign, and BWA.
  • LiftOver files for mapping between sequence coordinates.
  • UniRef protein databases, indexed for searching with BLAST+.

Coupled with the software images, this volume makes it easy to do next-gen analyses. Start up an Amazon AMI, attach the genome data volume, transfer your fastq file to the instance, and kick off the analysis. The overhead of software installation and genome indexing is completely removed. Thanks to the work of Enis Afgan and James Taylor of Galaxy, the data volume plugs directly into Galaxy’s ready to use cloud image. Coupling the data and software with Galaxy provides a familiar web interface for running tools and developing biological workflows.

The data volume preparation is fully automated via a fabric install script, similar to the software install script. Additional data sources are easily integrated, and we hope to expand the available datasets based on feedback from the community.

Documentation and presentations

The software and data volumes are only as good as the documentation which helps people use them:

Community: Codefest 2010

The CloudBioLinux community had a chance to work together for two days in July at Codefest 2010. In conjunction with the Bioinformatics Open Source Conference (BOSC) in Boston, this was a free to attend coding session hosted at Harvard School of Public Health and Massachusetts General Hospital. Over 30 developers donated two days of their time to working on CloudBioLinux and other bioinformatics open source projects.

Many of the advances in CloudBioLinux detailed above were made possible through this session: the FreeNX graphical client integration, documentation, Galaxy interoperability, and many library and data improvements were started during the two days of coding and discussions. Additionally, the relationships developed are the foundation for better communication amongst open source projects, which is something we need to be continually striving for in the scientific computing world.

It was amazing and inspiring to get such positive feedback from so many members of the bioinformatics community. We’re planning another session next year in Vienna, again just before BOSC and ISMB 2011; and again, everyone is welcome.


Go to the CloudBioLinux website for the latest publicly available images and data volumes, which are ready to use on Amazon EC2. With Amazon’s new micro-images you can start analyzing data for only a few cents an hour. It’s an easy way to explore if cloud resources will help with computational demands in your work. We’re very interested in feedback and happy to have other developers helping out; please get in touch on the CloudBioLinux mailing list.

Written by Brad Chapman

October 13, 2010 at 6:19 pm

Posted in OpenBio

Tagged with , , ,

Automated build environment for Bioinformatics cloud images

with 15 comments

Amazon web services provide scalable, on demand computational resources through their elastic compute cloud (EC2). Previously, I described the goal of providing publicly available machine images loaded with bioinformatics tools. I’m happy to describe an initial step in that direction: an automated build system, using easily editable configuration files, that generates a bioinformatics-focused Amazon Machine Image (AMI) containing packages integrated from several existing efforts. The hope is to consolidate the community’s open source work around a single, continuously improving, machine image.

This image incorporates software from several existing AMIs:

  • JCVI Cloud BioLinux — JCVI’s work porting Bio-Linux to the cloud.
  • bioperl-max — Fortinbras’ package of BioPerl and associated informatics tools.
  • MachetEC2 — An InfoChimps image loaded with data mining software.

Each of these libraries inspired different aspects of developing this image and associated infrastructure, and I’m extremely grateful to the authors for their code, documentation and discussions.

The current AMI is available for loading on EC2 — search for ‘CloudBioLinux’ in the AWS console or go to the CloudBioLinux project page for the latest AMIs. Automated scripts and configuration files with contained packages are available as a GitHub repository.

Contributions encouraged

This image is intended as a starting point for developing a community resource that provides biology and data-mining oriented software. Experienced developers should be able to fire up this image and expect to find the same up to date libraries and programs they have installed on their work machines. If their favorite package is missing it should be quick and easy to add, making the improvement available to future developers.

Achieving these goals requires help and contributions from other programmers utilizing the cloud — everyone reading this. The current image is ready to be used, but is more complete in areas where I normally work. For instance, the Python and R libraries are off to a good start. I’d like to extend an invitation to folks with expertise in other areas to help improve the coverage of this AMI:

  • Programmers: help expand the configuration files for your areas of interest:
    • Perl CPAN support and libraries
    • Ruby gems
    • Java libraries
    • Haskell hackage support and libraries
    • Erlang libraries
    • Bioinformatics areas of specialization:
      • Next-gen sequencing
      • Structural biology
      • Parallelized algorithms
    • Much more… Let us know what you are interested in.
  • Documentation experts: provide cookbook style instructions to help others get started.
  • Porting specialists: The automation infrastructure is dependent on having good ports for libraries and programs. Many widely used biological programs are not yet ported. Establishing a Debian or Ubuntu port for a missing program will not only help this effort, but make the programs more widely available.
  • Systems administrators: The ultimate goal is to have the AMI be automatically updated on a regular basis with the latest changes. We’d like to set up an Amazon instance that pulls down the latest configuration, populates an image, builds the AMI, and then updates a central web page and REST API for getting the latest and greatest.
  • Testers: Check that this runs on open source Eucalyptus clouds, additional linux distributions, and other cloud deployments.

If any of this sounds interesting, please get in contact. The Cloud BioLinux mailing list is a good central point for discussion.

Infrastructure overview

In addition to supplying an image for downstream use, this implementation was designed to be easily extendible. Inspired by the MachetEC2 project, packages to be installed are entered into a set of easy to edit configuration files in YAML syntax. There are three different configuration file types:

  • main.yaml — The high level configuration file defining which groups of packages to install. This allows a user to build a custom image simply by commenting out those groups which are not of interest.
  • packages.yaml — Defines debian/ubuntu packages to be installed. This leans heavily on the work of DebianMed and Bio-Linux communities, as well as all of the hard working package maintainers for the distributions. If it exists in package form, you can list it here.
  • python-libs.yaml, r-libs.yaml — These take advantage of language specific ways of installing libraries. Currently implemented is support for Python library installation from the Python package index, and R library installation from CRAN and Bioconductor. This will be expanded to include support for other languages.

The Fabric remote automated deployment tool is used to build AMIs from these configuration files. Written in Python, the fabfile automates the process of installing packages on the cloud machine.

We hope that the straightforward architecture of the build system will encourage other developers to dig in and provide additional coverage of program and libraries through the configuration files. For those comfortable with Python, the fabfile is very accessible for adding in new functionality.

If you are interested in face-to-face collaboration and will be in the Boston area on July 7th and 8th, check out Codefest 2010; it’ll be two enjoyable days of cloud informatics development. I’m looking forward to hearing from other developers who are interested in building and maintaining an easy to use, up to date, machine image that can help make biological computation more accessible to the community.

Written by Brad Chapman

May 8, 2010 at 9:35 am

Posted in OpenBio

Tagged with , , ,

Biopython projects for Google Summer of Code 2010

with 5 comments

Google Summer of Code provides the unique opportunity for students to spend a summer working on open source projects and getting paid. Biopython was involved with two great projects last summer, and it’s time to apply for this year’s program: the student application period is from next Monday, March 29th to Friday, April 9th, 2010.

If you are a student interested in biology and open source work, there are two community organizations to look at for mentors and project ideas:

  • NESCent Phyloinformatics — NESCent is a GSoC mentoring organization for the 4th year, focusing on projects related to phylogenetics and open source code.
  • Open Bioinformatics Foundation — The umbrella organization that manages BioPerl, Biopython, BioJava, BioRuby and several other popular open source bioinformatics projects is involved with GSoC for the first time.

This year, I’ve collaborated on three project ideas centering around the idea of tool integration. An essential programming skill for dealing with large heterogeneous data sets is combining a set of tools in a way that abstracts out the implementation details, instead allowing you to focus on the high level biological questions. Bradford Cross, a machine learning and data crunching expert at FlightCaster, describes this process brilliantly in an interview at Data Wrangling.

These three project ideas allow a student to develop essential toolkit integration skills, while having the flexibility to work on biological questions relevant to their undergrad or graduate research:

All involve taking two or more different toolkits and combining the functionality into a higher level interface focused around ease of use. They are intentionally broad and flexible ideas, and a student proposal should concentrate on functionality most relevant to their biological questions. Ideally the work would be both a publicly available resource, and contribute directly to the student’s daily research.

If you’re interested in these ideas and in working with a set of great mentors, definitely get in touch with me either through the project mailing lists or directly. If none of these ideas strike your fancy but you would like to be involved with GSoC, get in touch with a mentor from one of the other project ideas at NESCent and OpenBio. It’s a unique opportunity to develop new coding skills, work with great mentors, and give back to the open source community.

Written by Brad Chapman

March 26, 2010 at 8:02 am

Posted in OpenBio

Tagged with , ,

Trimming adaptors from short read sequences

with 11 comments

One common post processing step for next generation, or short read, sequencing is to remove adaptor sequences. Sample preparation for sequencing often involves selection of a particular fraction of interest like expressed genes, microRNAs, or copy number variant regions. Selecting, amplifying, and sequencing these regions can result in read sequences containing a common tag that needs to be identified and removed before continuing with other analysis steps like alignment to the genome. This is often the case when the reads of interest are small like in short RNAs or small sequence tags.

We start with a short known sequence (the adaptor) that will be present in our sequences of interest. For each read, we want to determine the position of this adaptor and remove it. Additionally, reads may contain the adaptor but differ by one or more bases; these differences are likely due to sequencing errors. So our solution needs to allow some fuzzy matching of the adaptor to the read sequences.

My first attempt involved using fuzzy string matching. In Python, several libraries exist to tackle this problem. The normal use case is in identifying misspelled words in text. For instance, you can calculate Levenshtein (edit) distance or use difflib in the standard library; Stack Overflow has a nice discussion of the different modules. Ultimately this approach proved to be a dead end; the properties that make English words similar do not overlap well with those differentiate DNA sequences.

That realization led me to tackle the problem by computing local alignments of the adaptor to each sequence. This effectively captures the biological intuition you need to trim a sequence with less than a specified number of errors. The downside to this rigorous approach is that it will be slower than purely string based methods.

The Biopython library contains a Python local alignment function suitable for quick alignment of short regions. We use this to align an adaptor region to a sequence and calculate the number of differences in the aligned region. To handle a common case where we can find the exact adaptor sequence, we first do an string match. This avoids the alignment overhead in many cases:

from Bio import pairwise2

def trim_adaptor(seq, adaptor, num_errors, right_side=True):
    gap_char = '-'
    exact_pos = str(seq).find(adaptor)
    if exact_pos >= 0:
        seq_region = str(seq[exact_pos:exact_pos+len(adaptor)])
        adapt_region = adaptor
        seq_a, adaptor_a, score, start, end = pairwise2.align.localms(str(seq),
                str(adaptor), 5.0, -4.0, -9.0, -0.5,
                one_alignment_only=True, gap_char=gap_char)[0]
        adapt_region = adaptor_a[start:end]
        seq_region = seq_a[start:end]
    matches = sum((1 if s == adapt_region[i] else 0) for i, s in
    # too many errors -- no trimming
    if (len(adaptor) - matches) > num_errors:
        return seq
    # remove the adaptor sequence and return the result
        return _remove_adaptor(seq, seq_region.replace(gap_char, ""),

If the alignment contains fewer than the specified number of errors we’ve found an adaptor and remove it, returning the trimmed sequence. The attribute right_side allows us to specify whether the trimming should be done on the right (3′ end) or the left (5′ end):

def _remove_adaptor(seq, region, right_side=True):
    """Remove an adaptor region and all sequence to the right or left.
    if right_side:
            pos = seq.find(region)
        # handle Biopython SeqRecords
        except AttributeError:
            pos = seq.seq.find(region)
        return seq[:pos]
            pos = seq.rfind(region)
        # handle Biopython SeqRecords
        except AttributeError:
            pos = seq.seq.rfind(region)
        return seq[pos+len(region):]

Here is an example script using this function along with Biopython to trim reads from a FASTQ format file.Each read is analyzed to determine if it contains the adaptor, with up to 2 errors. If the adaptor is identified and trimmed, the final useful read is written to a FASTA format output file.

from __future__ import with_statement
import sys
import os

from Bio import SeqIO

from adaptor_trim import trim_adaptor

def main(in_file):
    adaptor_seq = "TCGTATGCCGTCTTCTGC"
    num_errors = 2
    base, ext = os.path.splitext(in_file)
    out_file = "%s-trimmed.fa" % (base)
    with open(in_file) as in_handle:
        with open(out_file, "w") as out_handle:
            for rec in SeqIO.parse(in_handle, "fastq"):
                trim = trim_adaptor(rec.seq, adaptor_seq, num_errors)
                if len(trim) > 0 and len(trim) < len(rec):
                    rec.letter_annotations = {}
                    rec.seq = trim
                    SeqIO.write([rec], out_handle, "fasta")

You can get the full trimming code, along with tests, from GitHub. The script takes about a half hour to process 15 million 36bp reads. Other current trimming approaches are often integrated into the aligners themselves: for instance, NovoAlign appears to have similar alignment based trimming capabilities.

Written by Brad Chapman

August 9, 2009 at 8:18 pm

Posted in OpenBio

Tagged with , , ,

Sorting genomic alignments using Python

leave a comment »

Recent, I was asked to sort a MAF (Multiple Alignment Format) file containing genomic alignments. This was a large vertebrate alignment file and the researchers were interested in manually examining the longest matches. The tricky part of doing this is that the entire file cannot comfortably fit in memory on a standard machine.

The common solution to dealing with large memory intensive files is to pre-parse the file and provide a way to look up individual records based on an identifier. For instance, items could be pushed into a relational database table and retrieved based on primary keys. While moving to disk access is less efficient, it allows your code to scale to these type of real life problems without resorting to buying large memory machines.

We will use the bx-python library to read the MAF alignment files, prepare an index for accessing the alignments individually, and ultimately perform the sorting.

The first step is to define the index retrieval method. bx-python has index retrival functionality that allows querying and retrieving records based on genomic location. We don’t need that advanced query capability here, and can retrieve records based on position in the file as well. We build the index by looping through each record and recording the file position and genomic locations in the index.

from bx.align import maf
from bx import interval_index_file

def build_index(in_file, index_file):
    indexes = interval_index_file.Indexes()
    with open(in_file) as in_handle:
        reader = maf.Reader(in_handle)
        while 1:
            pos = reader.file.tell()
            rec =
            if rec is None:
            for c in rec.components:
                indexes.add(c.src, c.forward_strand_start,
                        c.forward_strand_end, pos, max=c.src_size )

    with open(index_file, "w") as index_handle:

Now we have a unique index to retrieve a record — the position in the file. The second step is to loop through the file again and build a list of alignment sizes and positions. Alignment size is the first value in each list item, since this is the criteria to sort by. This list easily fits in memory, and we sort it with the largest alignments first.

rec_info = []
with open(in_file) as in_handle:
    reader = maf.Reader(in_handle)
    while 1:
        pos = reader.file.tell()
        rec =
        if rec is None:
        rec_info.append((rec.text_size, pos))

Finally, we loop though the sorted list of sizes and use the associated positions to get records from our index. Each record is then written to an output file.

index = maf.Indexed(in_file, index_file)
with open(out_file, "w") as out_handle:
    writer = maf.Writer(out_handle)
    for size, pos in rec_info:
        rec = index.get_at_offset(pos)

The full script puts this together into a usable program. This could be adjusted to deal with other alignment formats supported by bx-python, like axt files.

Written by Brad Chapman

July 26, 2009 at 3:52 pm

Posted in OpenBio

Tagged with , ,

Thoughts from BOSC 2009; Python in bioinformatics

with 6 comments

I’d like to share my thoughts on two major themes that emerged from my trip to Sweden for the 2009 Bioinformatics Open Source Conference (BOSC). I talked briefly on publishing biological data on the web; you can check out the slides on Slideshare. This lead to discussion with several folks from the open source and Python bioinformatics communities. The major themes of these conversations were: organization within the Python bioinformatics community and growth of a platform for developing web enabled applications.

Python in Bioinformatics

One of the unique elements of the Python bioinformatics community is that the work is distributed amongst several different packages. Unlike Perl, where many programmers regularly consolidate their code into the BioPerl project, the Python community has settled on a few different packages: Biopython, bx-python, pygr and PyCogent are a few popular ones. Instead of working with a monolithic code base, users can pick functionality amongst several choices.

This distinctive organization is not a bad thing. It avoids creating an unwieldy package which can be hard to maintain and re-factor. It allows programmers to explore solutions in ways better suited to their particular problems. Finally, it provides individual recognition for the hard work researchers put into building and maintaining reusable code.

What the distribution does mean is that the Python bioinformatics community needs to work harder at communication and coordination. One great idea was to write a grant to help bring Python biology programmers together for a conference and hacking session, ideally alongside a conference like BOSC or SciPy. This would provide an impetus for contributors to learn and discuss each others platforms. Beyond goodwill and community, the deliverables would be documentation and code contributing to integration between projects. This would enable scientists by lowering the learning curves to producing useful biology related code in Python.


My talk focused on developing small, reusable presentation and backend components for building web enabled applications. One really insightful question asked whether the community should focus on building a platform these components could be plugged into, or if components themselves would eventually evolve towards a larger structure. The first idea was taken very successfully by the Firefox browser with their plugin architecture, which allows end users to build amazing web interfacing applications within the browser. The second approach is the one I am more used to taking: build relatively smaller things that work, with an eye towards integration.

A discussion with James Taylor convinced me that it was worthwhile to take a longer look at the Galaxy project. Galaxy is an excellent web based front end to many bioinformatics programs and scripts, allowing biologists to put together analysis pipelines. They have a powerful public site which means using Galaxy requires no installation for many use cases.

The code is also publicly available and you can run your own local Galaxy instances, plugging in custom programs through their XML based tool interface. The architecture of the system is remarkably similar to the one I converged on for my work. It features a Pylons like backend, SQLAlchemy for databases interactivity, and jQuery for the javascript interface. Deployment to cloud infrastructure can be done on Amazon EC2.

We have installed Galaxy locally and are taking it for a spin for our data presentation tasks. The tool plugin interface works as described and we have had good luck integrating it with custom input types. I will be trying more complex integrations with custom display and more Python code on the backend and hopefully have future posts covering that. Generally, I hope Galaxy can serve as a platform in which custom presentation code can be built, distributed, and reused.

I’d be happy to hear your thoughts about either the biology Python community or Galaxy as a platform for web presentation work.

Written by Brad Chapman

July 19, 2009 at 8:13 pm

Posted in OpenBio

Tagged with , ,