Posts Tagged ‘galaxy’
Making next-generation sequencing analysis pipelines easier with BioCloudCentral and Galaxy integration
My previous post described running an automated exome pipeline using CloudBioLinux and CloudMan, and generated incredibly useful feedback. Comments and e-mails pointed out potential points of confusion for new users deploying the process on custom data. I also had the chance to get hands on with researchers running CloudBioLinux and CloudMan during the AWS Genomics Event (talk slides are available).
The culmination of all this feedback are two new development projects from the CloudBioLinux community, aimed at making it easier to run custom analysis pipelines:
BioCloudCentral — A web service that launches CloudBioLinux and CloudMan clusters on Amazon Web Services hardware. This removes all of the manual steps involved in setting up security groups and launching a CloudBioLinux instance. A user only needs to sign up for an AWS account; BioCloudCentral takes care of everything else.
A custom Galaxy integrated front-end to next-generation sequencing pipelines. A jQuery UI wizard interface manages the intake of sequences and specification of parameters. It runs an automated backend processing pipeline with the structured input data, uploading results into Galaxy data libraries for additional analysis.
Special thanks are due to Enis Afgan for his help building these tools. He provided his boto expertise to the BioCloudCentral Amazon interaction, and generalized CloudMan to support the additional flexibility and automation on display here.
This post describes using these tools to start a CloudMan instance, create an SGE cluster and run a distributed variant calling analysis, all from the browser. The behind the scene details described earlier are available: the piepline uses a CloudBioLinux image containing a wide variety of bioinformatics software and you can use ssh or an NX graphical client to connect to the instance. This is the unique approach behind CloudBioLinux and CloudMan: they provide an open framework for building automated, easy-to-use workflows.
BioCloudCentral — starting a CloudBioLinux instance
To get started, sign up for an Amazon Web services account. This gives you access to on demand computing where you pay per hour of usage. Once signed up, you will need your Access Key ID and Secret Access Key from the Amazon security credentials page.
With these, navigate to BioCloudCentral and fill out the simple entry form. In addition to your access credentials, enter your choice of a name used to identify the cluster, and your choice of password to access the CloudMan web interface and the cluster itself via ssh or NX.
Clicking submit launches a CloudBioLinux server on Amazon. Be careful, since you are now paying per hour for your machine; remember to shut it down when finished.
Before leaving the monitoring page, you want to download a pre-formatted user-data file; this allows you to later start the same CloudMan instance directly from the Amazon web services console.
CloudMan — managing the cluster
The monitoring page on BioCloudCentral provides links directly to the CloudMan web interface. On the welcome page, start a shared CloudMan instance with this identifier:
This shared instance contains the custom Galaxy interface we will use, along with FASTQ sequence files for demonstration purposes. CloudMan will start up the filesystem, SGE, PostgreSQL and Galaxy. Once launched, you can use the CloudMan interface to add additional machines to your cluster for processing.
Galaxy pipeline interface — running the analysis
This Galaxy instance is a fork of the main codebase containing a custom pipeline interface in addition to all of the standard Galaxy tools. It provides an intuitive way to select FASTQ files for processing. Login with the demonstration account (user: email@example.com; password: example) and load FASTQ files along with target and bait BED files into your active history. Then work through the pipeline wizard step by step to start an analysis:
The Galaxy interface builds a configuration file describing the parameters and inputs, and submits this to the backend analysis server. This server kicks off processing, distributing the analysis across the SGE cluster. For the test data, processing will take approximately 4 hours on a cluster with a single additional work node (Large instance type).
Galaxy — retrieving and displaying results
The analysis pipeline uploads the finalized results into Galaxy data libraries. For this demonstration, the example user has results from a previous run in the data library so you don’t need to wait for the analysis to finish. This folder contains alignment data in BAM format, coverage information in BigWig format, a VCF file of variant calls, a tab separate file with predicted variant effects, and a PDF file of summary information. After importing these into your active Galaxy history, you can perform additional analysis on the data, including visualization in the UCSC genome browser:
As a reminder, don’t forget to terminate your cluster when finished. You can do this either from the CloudMan web interface or the Amazon console.
Analysis pipeline details and extending this work
The backend analysis pipeline is a freely available set of Python modules included on the CloudBioLinux AMI. The pipeline closely follows current best practice variant detection recommendations from the Broad GATK team:
- FASTQ alignment with BWA; source code
- Base quality score recalibration with GATK: source code
- Local realignment around indels with GATK: source code:
- Variant calling (SNPs and indels) using the GATK Unified Genotyper: source code
- Variant effect estimation with snpEff: source code
- Read coverage visualization with wigToBigWig: source code
The pipeline framework design is general, allowing incorporation of alternative aligners or variant calling algorithms.
We hope that in addition to being directly useful, this framework can fit within the work environments of other developers. The flexible toolkit used is: CloudBioLinux with open source bioinformatics libraries, CloudMan with a managed SGE cluster, Galaxy with a custom pipeline interface, and finally Python to parallelize and manage the processing. We invite you to fork and extend any of the different components. Thank you again to everyone for the amazing feedback on the analysis pipeline and CloudBioLinux.
Next generation sequencing technologies like Illumina, SOLiD and 454 have provided core facilities with the ability to produce large amounts of sequence data. Along with this increased output comes the challenge of managing requests and samples, tracking sequencing runs, and automating downstream analyses.
Our group at Massachusetts General Hospital approached these challenges by developing a sample submission and tracking interface on top of the web-based Galaxy data integration platform. It provides a front end for biologists to enter their sample details and monitor the status of a project. For lab technicians doing the sample preparation and sequencing work, the system tracks sample states via a set of progressive queues providing data entry points at each step of the process. On the back end, an automated analysis pipeline processes data as it arrives off the sequencer, uploading the results back into Galaxy.
This post will show videos of the interface in action, describe installation and extension of the system, and detail the implementation architecture.
Researcher sample entry
Biologists use a local Galaxy server as an entry point to submit samples for sequencing. This provides a familiar interface and central location for both entering sample information and retrieving and analyzing the sequencing data.
Practically, a user begins by browsing to the sample submission page. There they are presented with a wizard interface which guides them through entry of sample details. Multiplexed samples are supported through a drag and drop interface.
When all samples are entered, the user submits them as a sequencing project. This includes billing information and a project name to facilitate communication between the researcher and the core group about submissions. Users are able to view their submissions grouped as projects and track the state of constructs. Since we support a number of services in addition to sequencing — like library construction, quantitation and validation — this is a valuable way for users to track and organize their requests.
Sequencing tracking and management
Administrators and sequencing technicians have access to additional functionality to help manage the internal sample preparation and sequencing workflow. The main sample tracking interface centers around a set of queues; each queue represents a state that a sample can be in. Samples move through the queues as they are processed, with additional information being added to the sample at each step. For instance, a sample in the ‘Pre-sequencing quantitation’ queue moves to the ‘Sequencing’ queue once it has been fully quantitated, with that quantitation information entered by the sequencing technician during the transition.
Assigning samples to flow cells occurs using a drag and drop jQueryUI interface. The design is flexible to allow for placing samples across multiple lanes or multiplexing multiple barcoded samples into a single lane.
Viewing sequencing results
Running a sequencing machine requires careful monitoring of results and our interface provides several ways to view this data. Raw cluster and read counts are linked to a list of runs. For higher level analyses, interactive plots are available for viewing reads over time and pass rates compared to read density. These allow adjustment of experimental procedures to maximize useful reads based on current machine chemistry.
Utilizing a front end that organizes requests allows sequencing results to be processed through a fully automated analysis pipeline on the back end. The pipeline detects runs coming off of a sequencer, transfers files to storage and analysis machines and manages a number of processing steps:
- Alignment with bowtie or bwa.
- Generation of alignment and read statistics with Picard, the fastx toolkit and SolexaQA.
- Preparation of a summary PDF with detailed statistics about the run and alignment.
In addition to the default analysis, a full SNP calling pipeline is included with:
- GATK recalibration and realignment
- SNP identification with GATK’s unified genotyper
- Variant effect prediction with snpEff.
Fastq reads, alignments files, summary PDFs and other associated files are uploaded back into into Galaxy Data Libraries organized by sample names. Users can download results for offline work, or import them directly into their Galaxy history for further analysis or display.
Installing and extending
The code base is maintained as a Bitbucket repository that tracks the main galaxy-central distribution. It is updated from the main site regularly to maintain compatibility, with the future goal of integrating a generalized version into the main source tree. Detailed installation instructions are available for setting up the front-end client.
The analysis pipeline is written in Python and drives a number of open source programs; it is available as a GitHub repository with documentation and installation instructions.
We are using the current system in production and continue to develop and add features based on user feedback. We would like to generalize this for other research cores with additional instruments and services, and would be happy to hear from developers working on this type of system for their facilities.
This work would not have been possible without the great open source toolkits and frameworks that it builds on. Galaxy provides not only an analysis framework, but also a ready to use database structure for managing samples and requests. The front end builds off existing Galaxy sample tracking work, and requires only two new database storage tables.
The main change from the existing sample tracking framework is a generalization of the sample and request relationships. Requests can both contain samples, and be a part of samples so that a sequenced sample is organized as:
By reusing and expanding the great work of the Galaxy team, we hope to eventually integrate useful parts of this work into the Galaxy codebase.
I’d like to share my thoughts on two major themes that emerged from my trip to Sweden for the 2009 Bioinformatics Open Source Conference (BOSC). I talked briefly on publishing biological data on the web; you can check out the slides on Slideshare. This lead to discussion with several folks from the open source and Python bioinformatics communities. The major themes of these conversations were: organization within the Python bioinformatics community and growth of a platform for developing web enabled applications.
Python in Bioinformatics
One of the unique elements of the Python bioinformatics community is that the work is distributed amongst several different packages. Unlike Perl, where many programmers regularly consolidate their code into the BioPerl project, the Python community has settled on a few different packages: Biopython, bx-python, pygr and PyCogent are a few popular ones. Instead of working with a monolithic code base, users can pick functionality amongst several choices.
This distinctive organization is not a bad thing. It avoids creating an unwieldy package which can be hard to maintain and re-factor. It allows programmers to explore solutions in ways better suited to their particular problems. Finally, it provides individual recognition for the hard work researchers put into building and maintaining reusable code.
What the distribution does mean is that the Python bioinformatics community needs to work harder at communication and coordination. One great idea was to write a grant to help bring Python biology programmers together for a conference and hacking session, ideally alongside a conference like BOSC or SciPy. This would provide an impetus for contributors to learn and discuss each others platforms. Beyond goodwill and community, the deliverables would be documentation and code contributing to integration between projects. This would enable scientists by lowering the learning curves to producing useful biology related code in Python.
My talk focused on developing small, reusable presentation and backend components for building web enabled applications. One really insightful question asked whether the community should focus on building a platform these components could be plugged into, or if components themselves would eventually evolve towards a larger structure. The first idea was taken very successfully by the Firefox browser with their plugin architecture, which allows end users to build amazing web interfacing applications within the browser. The second approach is the one I am more used to taking: build relatively smaller things that work, with an eye towards integration.
A discussion with James Taylor convinced me that it was worthwhile to take a longer look at the Galaxy project. Galaxy is an excellent web based front end to many bioinformatics programs and scripts, allowing biologists to put together analysis pipelines. They have a powerful public site which means using Galaxy requires no installation for many use cases.
We have installed Galaxy locally and are taking it for a spin for our data presentation tasks. The tool plugin interface works as described and we have had good luck integrating it with custom input types. I will be trying more complex integrations with custom display and more Python code on the backend and hopefully have future posts covering that. Generally, I hope Galaxy can serve as a platform in which custom presentation code can be built, distributed, and reused.
I’d be happy to hear your thoughts about either the biology Python community or Galaxy as a platform for web presentation work.