Posts Tagged ‘ec2’
My last post described a distributed exome analysis pipeline implemented on the CloudBioLinux and CloudMan frameworks. This was a practical introduction to running the pipeline on Amazon resources. Here I’ll describe how the pipeline runs in parallel, specifically diagramming the workflow to identify points of parallelization during lane and sample processing.
Incredible innovation in throughput makes parallel processing critical for next-generation sequencing analysis. When a single Hi-Seq run can produce 192 samples (2 flowcells x 8 lanes per flowcell x 12 barcodes per lane), the analysis steps quickly become limited by the number of processing cores available.
The heterogeneity of architectures utilized by researchers is a major challenge in building re-usable systems. A pipeline needs to support powerful multi-core servers, clusters and virtual cloud-based machines. The approach we took is to scale at the level of individual samples, lanes and pipelines, exploiting the embarassingly parallel nature of the computation. An AMQP messaging queue allows for communication between processes, independent of the system architecture. This flexible approach allows the pipeline to serve as a general framework that can be easily adjusted or expanded to incorporate new algorithms and analysis methods.
Process overview — points for parallel implementations
The first level of parallelization occurs during processing of each fastq lane. We split the file into individualized barcoded components, followed by alignment and BAM processing. The result is a sorted BAM file for each barcoded sub-sample, given a set of input fastq files:
The pipeline merges samples present in barcodes on multiple lanes, producing a single representative BAM file. The next step parallelizes the processing of each alignment file with read quality assessment, preparation for visualization and variant calling:
The variant calling steps utilize The Genome Analysis Toolkit (GATK) from the Broad Institute. It prepares alignments by recalibrating initial quality scores given the aligned sequences and consistently realigning reads around indels. The Unified Genotyper identifies variants from this prepared alignment file, then uses these variants along with known true sites for assigning quality scores and filtering to a final set of calls:
Messaging approach to parallel execution
The process diagrams illustrate points of parallel execution for each fastq file and sample analysis. Practically, a top level analysis server manages each of the sub-processes. A command line script, a LIMS system or a specialized Galaxy interface start this top level process. RabbitMQ messaging facilitates communication between the analysis controller and processing nodes:
In my previous post, CloudMan manages this entire process. The web interface controls a pre-configured SGE cluster and a custom script starts the job on this cluster. However, the general nature of the pipeline architecture allows this to work equally well on multiple core machines or a heterogeneous set of connected machines.
The CloudMan work demonstrates that clusters, especially on-demand virtual images like those available from Amazon, are be a powerful way to scale analyses. Equally important, it provides an open platform to share these pipelines and encourage re-use. The code for the pipeline is available from the bcbio-nextgen GitHub repository
A major challenge in building analysis pipelines for next-generation sequencing data is combining a large number of processing steps in a flexible, scalable manner. Current best-practice software needs to be installed and configured alongside the custom code to chain individual programs together. Scaling to handle increasing throughput requires running that custom code on a wide variety of parallel architectures, from single multicore machines to heterogeneous clusters.
Establishing community resources that meet the challenges of building these pipelines ensures that bioinformatics programmers can share the burden of building large scale systems. Two open-source efforts which aim at providing this type of architecture are:
CloudBioLinux — A community effort to create shared images filled with bioinformatics software and libraries, using an automated build environment.
CloudMan — Uses CloudBioLinux as a platform to build a full SGE cluster environment. Written by Enis Afgan and the Galaxy Team, CloudMan is used to provide a ready-to-run, dynamically scalable version of Galaxy on Amazon AWS.
Here we combine CloudBioLinux software with a CloudMan SGE cluster to build a fully automated pipeline for processing high throughput exome sequencing data:
- The underlying analysis software is from CloudBioLinux.
- CloudMan provides an SGE cluster managed via a web front end.
- RabbitMQ is used for communication between cluster nodes.
- An automated pipeline, written in Python, organizes parallel processing across the cluster.
Start cluster with CloudBioLinux and CloudMan
Start in the Amazon web console, a convenient front end for managing EC2 servers. The first step is to follow the CloudMan setup instructions to create an Amazon account and set up appropriate security groups and user data. The wiki page contains detailed screencasts. Below is a short screencast showing how to boot your CloudBioLinux specific CloudMan server:
Once this is booted, proceed to the CloudMan web interface on the server and startup an instance from this shared identifier:
This screencast shows all of the details, including starting an additional node on the SGE cluster:
Configure AMQP messaging
Edit: The AMQP messaging steps have now been full automated so the configuration steps in this section are no longer required. Skip down to the ‘Run Analysis’ section to start processing the data immediately.
With your server booted and ready to run, the next step is to configure RabbitMQ messaging to communicate between nodes on your cluster. In the AWS console, find the external and internal hostname of the head machine. Start by opening an ssh connection to the machine with the external hostname:
$ ssh -i your-keypair email@example.com
/export/data/galaxy/universe_wsgi.ini configuration file to add the internal hostname. After editing, the AMQP section will look like:
[galaxy_amqp] host = ip-10-125-10-182.ec2.internal port = 5672 userid = biouser password = tester
Finally, add the user and virtual host to the running RabbitMQ server on the master node with 3 commands:
$ sudo rabbitmqctl add_user biouser tester creating user "biouser" ... ...done. $ sudo rabbitmqctl add_vhost bionextgen creating vhost "bionextgen" ... ...done. $ sudo rabbitmqctl set_permissions -p bionextgen biouser ".*" ".*" ".*" setting permissions for user "biouser" in vhost "bionextgen" ... ...done.
With messaging in place, we are ready to run the analysis.
/export/data contains a ready to run example exome analysis, with FASTQ input files in
/export/data/exome_example/fastq and configuration information in
/export/data/exome_example/config. Start the fully automated pipeline with a single command:
$ cd /export/data/work $ distributed_nextgen_pipeline.py /export/data/galaxy/post_process.yaml /export/data/exome_example/fastq /export/data/exome_example/config/run_info.yaml
distributed_nextgen_pipeline.py starts processing servers on each of the cluster nodes, using SGE for scheduling. Then a top level analysis server runs, splitting the FASTQ data across the nodes at each step of the process:
- Alignment with BWA
- Preparation of merged alignment files with Picard
- Recalibration and realignment with GATK
- Variant calling with GATK
- Assessment of predicted variant effects with snpEff
- Preparation of summary PDFs for each sample with read details from FastQC alongside alignment, hybrid selection and variant calling statistics from Picard
Monitor the running process
The example data is from a human chromosome 22 hybrid selection experiment. While running, you can keep track of the progress in several ways. SGEs
qstat command will tell you where the analysis servers are running on the cluster:
$ qstat ob-ID prior name user state submit/start at queue ---------------------------------------------------------------------------------- 1 0.55500 nextgen_an ubuntu r 08/14/2011 18:16:32 firstname.lastname@example.org 2 0.55500 nextgen_an ubuntu r 08/14/2011 18:16:32 email@example.com 3 0.55500 automated_ ubuntu r 08/14/2011 18:16:47 firstname.lastname@example.org
Listing files in the working directory will show our progress:
$ cd /export/data/work $ ls -lh drwxr-xr-x 2 ubuntu ubuntu 4.0K 2011-08-13 21:09 alignments -rw-r--r-- 1 ubuntu ubuntu 2.0K 2011-08-13 21:17 automated_initial_analysis.py.o11 drwxr-xr-x 2 ubuntu ubuntu 33 2011-08-13 20:43 log -rw-r--r-- 1 ubuntu ubuntu 15K 2011-08-13 21:17 nextgen_analysis_server.py.o10 -rw-r--r-- 1 ubuntu ubuntu 15K 2011-08-13 21:17 nextgen_analysis_server.py.o9 drwxr-xr-x 8 ubuntu ubuntu 102 2011-08-13 21:06 tmp
The files that end with
.o* are log files from each of the analysis servers and provide detailed information about the current state of processing at each server:
$ less nextgen_analysis_server.py.o10 INFO: nextgen_pipeline: Processing sample: Test replicate 2; lane 8; reference genome hg19; researcher ; analysis method SNP calling INFO: nextgen_pipeline: Aligning lane 8_100326_FC6107FAAXX with bwa aligner INFO: nextgen_pipeline: Combining and preparing wig file [u'', u'Test replicate 2'] INFO: nextgen_pipeline: Recalibrating [u'', u'Test replicate 2'] with GATK
The processing pipeline results in numerous intermediate files. These take up a lot of disk space and are not necessary after processing is finished. The final step in the process is to extract the useful files for visualization and further analysis:
$ upload_to_galaxy.py /export/data/galaxy/post_process.yaml /export/data/exome_example/fastq /export/data/work /export/data/exome_example/config/run_info.yaml
For each sample, this script copies:
- A BAM file with aligned sequeneces and original FASTQ data
- A realigned and recalibrated BAM file, ready for variant calling
- Variant calls in VCF format.
- A tab delimited file of predicted variant effects.
- A PDF summary file containing alignment, variant calling and hybrid selection statistics.
into an output directory for the flowcell:
$ ls -lh /export/data/galaxy/storage/100326_FC6107FAAXX/7 -rw-r--r-- 1 ubuntu ubuntu 38M 2011-08-19 20:50 7_100326_FC6107FAAXX.bam -rw-r--r-- 1 ubuntu ubuntu 22M 2011-08-19 20:50 7_100326_FC6107FAAXX-coverage.bigwig -rw-r--r-- 1 ubuntu ubuntu 72M 2011-08-19 20:51 7_100326_FC6107FAAXX-gatkrecal.bam -rw-r--r-- 1 ubuntu ubuntu 109K 2011-08-19 20:51 7_100326_FC6107FAAXX-snp-effects.tsv -rw-r--r-- 1 ubuntu ubuntu 827K 2011-08-19 20:51 7_100326_FC6107FAAXX-snp-filter.vcf -rw-r--r-- 1 ubuntu ubuntu 1.6M 2011-08-19 20:50 7_100326_FC6107FAAXX-summary.pd
As suggested by the name, the script can also integrate the data into a Galaxy instance if desired. This allows biologists to perform further data analysis, including visual inspection of the alignments in the UCSC browser.
All components of the pipeline are open source and part of community projects. CloudMan, CloudBioLinux and the pipeline are customized through YAML configuration files. Combined with the CloudMan managed SGE cluster, the pipeline can be applied in parallel to any number of samples.
The overall goal is to share the automated infrastructure work that moves samples from sequencing to being ready for analysis. This allows biologists more rapid access to the processed data, focusing attention on the real work: answering scientific questions.
If you’d like to hear more about CloudBioLinux, CloudMan and the exome sequencing pipeline, I’ll be discussing it at the AWS Genomics Event in Seattle on September 22nd.
One challenge with moving analysis pipelines to cloud resources like Amazon EC2 is figuring out the logistics of transferring files. Biological data is big; with the rapid adoption of new machines like the HiSeq and decreasing sequencing costs, the data transfer question isn’t going away soon. The use of Amazon in bioinformatics was brought up during a recent discussion on the BioStar question answer site. Deepak’s answer highlighted the role of parallelizing uploads and downloads to ease this transfer burden. Here I describe a method to improve upload speed by splitting over multiple processing cores.
Amazon Simple Storage System (S3) provides relatively inexpensive cloud storage with their reduced redundancy storage option. S3, and all of Amazon’s cloud services, are accessible directly from Python using boto. By using boto’s multipart upload support, coupled with Python’s built in multiprocessing module, I’ll demonstrate maximizing transfer speeds to make uploading data less painful. The script is available from GitHub and requires the latest boto from GitHub (2.0b5 or better).
Parallel upload with multiprocessing
The overall process uses boto to connect to an S3 upload bucket, initialize a multipart transfer, split the file into multiple pieces, and then upload these pieces in parallel over multiple cores. Each processing core is passed a set of credentials to identify the transfer: the multipart upload identifier (
mp.id), the S3 file key name (
mp.key_name) and the S3 bucket name (
import boto conn = boto.connect_s3() bucket = conn.lookup(bucket_name) mp = bucket.initiate_multipart_upload(s3_key_name, reduced_redundancy=use_rr) with multimap(cores) as pmap: for _ in pmap(transfer_part, ((mp.id, mp.key_name, mp.bucket_name, i, part) for (i, part) in enumerate(split_file(tarball, mb_size, cores)))): pass mp.complete_upload()
split_file function uses the unix split command to divide the file into sections, each of which will be uploaded separately.
def split_file(in_file, mb_size, split_num=5): prefix = os.path.join(os.path.dirname(in_file), "%sS3PART" % (os.path.basename(s3_key_name))) split_size = int(min(mb_size / (split_num * 2.0), 250)) if not os.path.exists("%saa" % prefix): cl = ["split", "-b%sm" % split_size, in_file, prefix] subprocess.check_call(cl) return sorted(glob.glob("%s*" % prefix))
The multiprocessing aspect is managed using a contextmanager. The initial multiprocessing pool is setup, using a specified number of cores, and configured to allow keyboard interrupts. We then return a lazy map function (imap) which can be used just like Python’s standard
map. This transparently divides the function calls for each file part over all available cores. Finally, the pool is cleaned up when the map is finished running.
@contextlib.contextmanager def multimap(cores=None): if cores is None: cores = max(multiprocessing.cpu_count() - 1, 1) def wrapper(func): def wrap(self, timeout=None): return func(self, timeout=timeout if timeout is not None else 1e100) return wrap IMapIterator.next = wrapper(IMapIterator.next) pool = multiprocessing.Pool(cores) yield pool.imap pool.terminate()
The actual work of transferring each portion of the file is done using two functions. The helper function,
mp_from_ids, uses the id information about the bucket, file key and multipart upload id to reconstitute a multipart upload object:
def mp_from_ids(mp_id, mp_keyname, mp_bucketname): conn = boto.connect_s3() bucket = conn.lookup(mp_bucketname) mp = boto.s3.multipart.MultiPartUpload(bucket) mp.key_name = mp_keyname mp.id = mp_id return mp
This object, together with the number of the file part and the file itself, are used to transfer that section of the file. The file part is removed after successful upload.
@map_wrap def transfer_part(mp_id, mp_keyname, mp_bucketname, i, part): mp = mp_from_ids(mp_id, mp_keyname, mp_bucketname) print " Transferring", i, part with open(part) as t_handle: mp.upload_part_from_file(t_handle, i+1) os.remove(part)
When all sections, distributed over all processors, are finished, the multipart upload is signaled complete and Amazon finishes the process. Your file is now available on S3.
Download speeds can be maximized by utilizing several existing parallelized accelerators:
Combine these with the uploader to build up a cloud analysis workflow: move your data to S3, run a complex analysis pipeline on EC2, push the results back to S3, and then download them to local machines. Please share other tips and tricks you use to deal with Amazon file transfer in the comments.
My last post introduced a framework for building bioinformatics cloud images, which makes it easy to do biological computing work using Amazon EC2 and other on-demand computing providers. Since that initial announcement we’ve had amazing interest from the community and made great progress with:
- A permanent web site at cloudbiolinux.org
- Additional software and genomic data
- New user documentation
- A community coding session: Codefest 2010
New software and data
The most exciting changes have been the rapid expansion of installed software and libraries. The goal is to provide an image that experienced developers will find as useful as their custom configured servers. A great group of contributors have put together a large set of programs and libraries; the configuration files have all the details on installed programs as well as libraries for Python, Perl, Ruby, and R. Another addition is support for non-packaged programs which provides software not yet neatly wrapped in a package manger or library-specific install system: next-gen software packages like Picard, GATK and Bowtie are installed through custom scripts.
To improve accessibility for developers who prefer a desktop experience, a FreeNX server was integrated with the provided images. Tim Booth from the NEBC Bio-Linux team headed up the integration of FreeNX, and the user experience looks very similar to a locally installed Bio-Linux desktop.
In addition to the software image, a publicly available data volume is now available that contains:
- Genome sequences pre-indexed for search with next-gen aligners like Bowtie, Novoalign, and BWA.
- LiftOver files for mapping between sequence coordinates.
- UniRef protein databases, indexed for searching with BLAST+.
Coupled with the software images, this volume makes it easy to do next-gen analyses. Start up an Amazon AMI, attach the genome data volume, transfer your fastq file to the instance, and kick off the analysis. The overhead of software installation and genome indexing is completely removed. Thanks to the work of Enis Afgan and James Taylor of Galaxy, the data volume plugs directly into Galaxy’s ready to use cloud image. Coupling the data and software with Galaxy provides a familiar web interface for running tools and developing biological workflows.
The data volume preparation is fully automated via a fabric install script, similar to the software install script. Additional data sources are easily integrated, and we hope to expand the available datasets based on feedback from the community.
Documentation and presentations
The software and data volumes are only as good as the documentation which helps people use them:
- Bela Tiwari of the NEBC Bio-Linux team has written an excellent introduction to Amazon EC2 and CloudBioLinux. This breaks down the process of signing up for an account, creating a software image, associating data volumes and setting up a graphical server. It’s a great place to get started with CloudBioLinux.
- Ntino Krampis, from the JCVI Cloud Bio-Linux project, gave a presentation on CloudBioLinux explaining the motivation behind the project and providing usage examples.
- My presentation on the open source community behind CloudBioLinux from Amazon’s Genomic Data workshop. This details the project goals and automated code organization.
Community: Codefest 2010
The CloudBioLinux community had a chance to work together for two days in July at Codefest 2010. In conjunction with the Bioinformatics Open Source Conference (BOSC) in Boston, this was a free to attend coding session hosted at Harvard School of Public Health and Massachusetts General Hospital. Over 30 developers donated two days of their time to working on CloudBioLinux and other bioinformatics open source projects.
Many of the advances in CloudBioLinux detailed above were made possible through this session: the FreeNX graphical client integration, documentation, Galaxy interoperability, and many library and data improvements were started during the two days of coding and discussions. Additionally, the relationships developed are the foundation for better communication amongst open source projects, which is something we need to be continually striving for in the scientific computing world.
It was amazing and inspiring to get such positive feedback from so many members of the bioinformatics community. We’re planning another session next year in Vienna, again just before BOSC and ISMB 2011; and again, everyone is welcome.
Go to the CloudBioLinux website for the latest publicly available images and data volumes, which are ready to use on Amazon EC2. With Amazon’s new micro-images you can start analyzing data for only a few cents an hour. It’s an easy way to explore if cloud resources will help with computational demands in your work. We’re very interested in feedback and happy to have other developers helping out; please get in touch on the CloudBioLinux mailing list.
Amazon web services provide scalable, on demand computational resources through their elastic compute cloud (EC2). Previously, I described the goal of providing publicly available machine images loaded with bioinformatics tools. I’m happy to describe an initial step in that direction: an automated build system, using easily editable configuration files, that generates a bioinformatics-focused Amazon Machine Image (AMI) containing packages integrated from several existing efforts. The hope is to consolidate the community’s open source work around a single, continuously improving, machine image.
This image incorporates software from several existing AMIs:
- JCVI Cloud BioLinux — JCVI’s work porting Bio-Linux to the cloud.
- bioperl-max — Fortinbras’ package of BioPerl and associated informatics tools.
- MachetEC2 — An InfoChimps image loaded with data mining software.
Each of these libraries inspired different aspects of developing this image and associated infrastructure, and I’m extremely grateful to the authors for their code, documentation and discussions.
The current AMI is available for loading on EC2 — search for ‘CloudBioLinux’ in the AWS console or go to the CloudBioLinux project page for the latest AMIs. Automated scripts and configuration files with contained packages are available as a GitHub repository.
This image is intended as a starting point for developing a community resource that provides biology and data-mining oriented software. Experienced developers should be able to fire up this image and expect to find the same up to date libraries and programs they have installed on their work machines. If their favorite package is missing it should be quick and easy to add, making the improvement available to future developers.
Achieving these goals requires help and contributions from other programmers utilizing the cloud — everyone reading this. The current image is ready to be used, but is more complete in areas where I normally work. For instance, the Python and R libraries are off to a good start. I’d like to extend an invitation to folks with expertise in other areas to help improve the coverage of this AMI:
- Programmers: help expand the configuration files for your areas of interest:
- Perl CPAN support and libraries
- Ruby gems
- Java libraries
- Haskell hackage support and libraries
- Erlang libraries
- Bioinformatics areas of specialization:
- Next-gen sequencing
- Structural biology
- Parallelized algorithms
- Much more… Let us know what you are interested in.
- Documentation experts: provide cookbook style instructions to help others get started.
- Porting specialists: The automation infrastructure is dependent on having good ports for libraries and programs. Many widely used biological programs are not yet ported. Establishing a Debian or Ubuntu port for a missing program will not only help this effort, but make the programs more widely available.
- Systems administrators: The ultimate goal is to have the AMI be automatically updated on a regular basis with the latest changes. We’d like to set up an Amazon instance that pulls down the latest configuration, populates an image, builds the AMI, and then updates a central web page and REST API for getting the latest and greatest.
- Testers: Check that this runs on open source Eucalyptus clouds, additional linux distributions, and other cloud deployments.
If any of this sounds interesting, please get in contact. The Cloud BioLinux mailing list is a good central point for discussion.
In addition to supplying an image for downstream use, this implementation was designed to be easily extendible. Inspired by the MachetEC2 project, packages to be installed are entered into a set of easy to edit configuration files in YAML syntax. There are three different configuration file types:
- main.yaml — The high level configuration file defining which groups of packages to install. This allows a user to build a custom image simply by commenting out those groups which are not of interest.
- packages.yaml — Defines debian/ubuntu packages to be installed. This leans heavily on the work of DebianMed and Bio-Linux communities, as well as all of the hard working package maintainers for the distributions. If it exists in package form, you can list it here.
- python-libs.yaml, r-libs.yaml — These take advantage of language specific ways of installing libraries. Currently implemented is support for Python library installation from the Python package index, and R library installation from CRAN and Bioconductor. This will be expanded to include support for other languages.
We hope that the straightforward architecture of the build system will encourage other developers to dig in and provide additional coverage of program and libraries through the configuration files. For those comfortable with Python, the fabfile is very accessible for adding in new functionality.
If you are interested in face-to-face collaboration and will be in the Boston area on July 7th and 8th, check out Codefest 2010; it’ll be two enjoyable days of cloud informatics development. I’m looking forward to hearing from other developers who are interested in building and maintaining an easy to use, up to date, machine image that can help make biological computation more accessible to the community.
Amazon Web Services provide an excellent distributed computing infrastructure through their Elastic Compute Cloud (EC2), Elastic Block Storage (EBS) and associated resources. Essentially, they make available on demand compute power and storage at prices that scale with usage. In the past I’ve written about using EC2 for parallel parsing of large files. Generally, I am a big proponent of distributed computing as a solution to dealing with problems ranging from job scaling to improving code availability.
One of the challenges in advocating for using EC2 at my day to day work is the presence of existing computing resources. We have servers and clusters, but how will we scale for future work? Thankfully, we are able to assess the utility of Amazon services for future scaling through their education and research grants. Our group applied and was accepted for a research grant which we plan to use to develop and distribute next generation sequencing analyses both within our group at Mass General Hospital and in the larger community.
Amazon Machine Images (AMIs) provide an opportunity for the open source bioinformatics community to increase code availability. AMIs are essentially pre-built operating systems with installed programs. By creating AMIs and making them available, a programmer can make their code readily accessible to users and avoid any of the intricacies of installation and configuration. Add this to available data in the form of public data sets and you have a ready to go analysis platform with very little overhead. There is already a large set of available AMIs from which to build.
This idea and our thoughts on moving portions of our next generation sequencing analysis to EC2 are fleshed out further in our research grant application, portions of which are included below. We’d love to collaborate with others moving their bioinformatics work to Amazon resources.
One broad area of rapid growth in biology research is deep sequencing (or short read) technology. A single lab investigator can produce hundreds of millions of DNA sequences, equivalent in scale to the entire human genome, in a period of days. This DNA sequencing technology is widely available through both on-site facilities as well as through commercial services. Creating scalable analysis methods is a high priority for the entire bioinformatics community; see http://selab.janelia.org/people/eddys/blog/?p=123 for a presentation nicely summarizing the issues. We propose to address the computational bottlenecks resulting from this huge data volume using distributed AWS resources.
An additional aim of our work is to provide tools to biologists looking to solve their data analysis challenges. When the computational portion of a project becomes a time limiting step, we can often speed up the cycling between experiment and analysis by providing researchers with ready to run scripts or web interfaces. However, this is complicated by high usage on shared computational resources and heterogeneous platforms requiring time consuming configuration. Both problems could be ameliorated by scalable EC2 instances with custom configured machine images.
The goals of this grant application are to develop our analysis platform on Amazon’s compute cloud and assess transfer, storage and utilization costs. We currently have internal computational resources ranging from high performance clusters to large memory machines. We believe Amazon’s compute cloud to be an ideal solution as our analysis needs outgrow our current hardware.
Benefits to Amazon and the community
Developing software on AWS architecture presents a move towards a standard platform for bioinformatics research. Our group is invested in the open source community and shares both code and analysis tools. One common hindrance to sharing is the heterogeneity of platforms; code is developed on a local cluster and not readily generalizable, hence it is not shared.
By building public machine images along with reusable source code, a diverse variety of users can readily use our code and tools. As short read sequencing continues to increase in utility and popularity, a practical ready-to-go platform for analyses will encourage many users to adopt parallelization on cloud resources as a research approach. We have begun initial work with this paradigm by developing parsers for large annotation files using MapReduce on EC2.
Having the ability to utilize AWS with your support will help us further develop and disseminate analysis templates for the larger biology community, enabling science both at MGH and elsewhere.
The Bioinformatics Open Source Conference (BOSC) is taking place later this month in Stockholm, Sweden. I will be attending for the first time in a few years, and giving a short and sweet 10 minute talk about ideas for publishing biological data on the web. BOSC provides a chance to meet and talk with many of the great people involved in open source bioinformatics; the schedule this year looks fantastic. The talk will be held in conjunction with The Data and Analysis Management special interest group, also full of interesting talks.
The talk will promote development of reusable web based interface libraries layered on top of existing open source projects. The PDF abstract provides the full description and motivation; below is a more detailed outline based on some brainstorming and organization:
- Motivation: rapidly organize and display biological data in a web accessible format.
- Current state: reusable bioinformatics libraries targeted at programmers — Biopython, bx-python, pygr, PyCogent
- Current state: back end databases for storing biological data — BioSQL, GMOD
- Current state: full featured web applications targeted at users — Galaxy, GBrowse
- My situation: biologist and developer with organized data that needs analysis and presentation, internally with collaborators and externally with larger community.
- Proposal: integrate bioinformatics libraries, database schemas, and open source web development frameworks to provide re-usable components that can serve as a base for custom data presentation.
- Framework: utilize cloud infrastructure for reliable deployment — Google App Engine, Amazon EC2
- Framework: make use of back end web frameworks — Pylons
- Implementation: Demo server for displaying sequences plus annotations
- Implementation: Utilizes BioSQL schema, ported to object oriented data store; Google App engine backend or MongoDB backend
- Implementation: Data import/export with Biopython libraries — GenBank in and GFF out
- Implementation: Additional screenshots from internal web displays.
- Challenges: Generalizing and organizing display and retrieval code without having to buy into a large framework.
- Challenges: Build a community that thinks of reusing and sharing display code as much as parsing and pipeline development code.
I would be happy to hear comments or suggestions about the talk. If you’re going to BOSC and want to meet up, definitely drop me a line.