<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Blue Collar Bioinformatics</title>
	<atom:link href="http://bcbio.wordpress.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://bcbio.wordpress.com</link>
	<description></description>
	<lastBuildDate>Fri, 30 Dec 2011 10:34:16 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='bcbio.wordpress.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://s2.wp.com/i/buttonw-com.png</url>
		<title>Blue Collar Bioinformatics</title>
		<link>http://bcbio.wordpress.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://bcbio.wordpress.com/osd.xml" title="Blue Collar Bioinformatics" />
	<atom:link rel='hub' href='http://bcbio.wordpress.com/?pushpress=hub'/>
		<item>
		<title>Making next-generation sequencing analysis pipelines easier with BioCloudCentral and Galaxy integration</title>
		<link>http://bcbio.wordpress.com/2011/11/29/making-next-generation-sequencing-analysis-pipelines-easier-with-biocloudcentral-and-galaxy-integration/</link>
		<comments>http://bcbio.wordpress.com/2011/11/29/making-next-generation-sequencing-analysis-pipelines-easier-with-biocloudcentral-and-galaxy-integration/#comments</comments>
		<pubDate>Wed, 30 Nov 2011 01:50:52 +0000</pubDate>
		<dc:creator>Brad Chapman</dc:creator>
				<category><![CDATA[analysis]]></category>
		<category><![CDATA[bioinformatics]]></category>
		<category><![CDATA[cloud-computing]]></category>
		<category><![CDATA[galaxy]]></category>
		<category><![CDATA[ngs]]></category>

		<guid isPermaLink="false">http://bcbio.wordpress.com/?p=251</guid>
		<description><![CDATA[My previous post described running an automated exome pipeline using CloudBioLinux and CloudMan, and generated incredibly useful feedback. Comments and e-mails pointed out potential points of confusion for new users deploying the process on custom data. I also had the chance to get hands on with researchers running CloudBioLinux and CloudMan during the AWS Genomics [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bcbio.wordpress.com&amp;blog=5850073&amp;post=251&amp;subd=bcbio&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>My <a href="http://bcbio.wordpress.com/2011/08/19/distributed-exome-analysis-pipeline-with-cloudbiolinux-and-cloudman/">previous post described running an automated exome pipeline using CloudBioLinux and CloudMan</a>, and generated incredibly useful feedback. Comments and e-mails pointed out potential points of confusion for new users deploying the process on custom data. I also had the chance to get hands on with researchers running <a href="http://cloudbiolinux.org/">CloudBioLinux</a> and <a href="http://www.usecloudman.org/">CloudMan</a> during the <a href="http://aws.amazon.com/genomicsevent/">AWS Genomics Event</a> (<a href="http://www.slideshare.net/chapmanb/developing-distributed-analysis-pipelines-with-shared-community-resources-using-cloudbiolinux-and-cloudman">talk slides are available</a>).</p>
<p>The culmination of all this feedback are two new development projects from the CloudBioLinux community, aimed at making it easier to run custom analysis pipelines:</p>
<ul>
<li>
<p><a href="http://biocloudcentral.org">BioCloudCentral</a> &#8212; A web service that launches CloudBioLinux and CloudMan clusters on <a href="http://aws.amazon.com/">Amazon Web Services</a> hardware. This removes all of the manual steps involved in setting up security groups and launching a CloudBioLinux instance. A user only needs to sign up for an AWS account; BioCloudCentral takes care of everything else.</p>
</li>
<li>
<p>A custom <a href="http://galaxyproject.org/">Galaxy</a> integrated front-end to next-generation sequencing pipelines. A <a href="http://jqueryui.com/">jQuery UI</a> wizard interface manages the intake of sequences and specification of parameters. It runs an automated backend processing pipeline with the structured input data, uploading results into Galaxy data libraries for additional analysis.</p>
</li>
</ul>
<p>Special thanks are due to <a href="http://www.usecloudman.org/enis/">Enis Afgan</a> for his help building these tools. He provided his <a href="http://readthedocs.org/docs/boto/en/latest/">boto</a> expertise to the BioCloudCentral Amazon interaction, and generalized CloudMan to support the additional flexibility and automation on display here.</p>
<p>This post describes using these tools to start a CloudMan instance, create an SGE cluster and run a distributed variant calling analysis, all from the browser. The behind the scene details described earlier are available: the piepline uses a CloudBioLinux image containing a wide variety of bioinformatics software and you can use ssh or an <a href="http://www.nomachine.com/download.php">NX graphical client</a> to connect to the instance. This is the unique approach behind CloudBioLinux and CloudMan: they provide an open framework for building automated, easy-to-use workflows.</p>
<h2 id="biocloudcentral----starting-a-cloudbiolinux-instance">BioCloudCentral &#8212; starting a CloudBioLinux instance</h2>
<p>To get started, sign up for an <a href="http://aws.amazon.com/">Amazon Web services account</a>. This gives you access to on demand computing where you pay per hour of usage. Once signed up, you will need your Access Key ID and Secret Access Key from the <a href="https://aws-portal.amazon.com/gp/aws/developer/account/index.html?action=access-key">Amazon security credentials page</a>.</p>
<p>With these, navigate to <a href="http://biocloudcentral.org">BioCloudCentral</a> and fill out the simple entry form. In addition to your access credentials, enter your choice of a name used to identify the cluster, and your choice of password to access the CloudMan web interface and the cluster itself via ssh or NX.</p>
<p><span style="text-align:center; display: block;"><a href="http://bcbio.wordpress.com/2011/11/29/making-next-generation-sequencing-analysis-pipelines-easier-with-biocloudcentral-and-galaxy-integration/"><img src="http://img.youtube.com/vi/9ZqvxhwcPaw/2.jpg" alt="" /></a></span><br />
</p>
<p>Clicking submit launches a CloudBioLinux server on Amazon. Be careful, since you are now paying per hour for your machine; remember to shut it down when finished.</p>
<p>Before leaving the monitoring page, you want to download a pre-formatted user-data file; this allows you to later start the same CloudMan instance directly from the <a href="https://console.aws.amazon.com/ec2/home">Amazon web services console</a>.</p>
<h2 id="cloudman----managing-the-cluster">CloudMan &#8212; managing the cluster</h2>
<p>The monitoring page on BioCloudCentral provides links directly to the CloudMan web interface. On the welcome page, start a shared CloudMan instance with this identifier:</p>
<pre><code>cm-b53c6f1223f966914df347687f6fc818/shared/2011-11-29--01-44
</code></pre>
<p></p>
<p>This shared instance contains the custom Galaxy interface we will use, along with FASTQ sequence files for demonstration purposes. CloudMan will start up the filesystem, SGE, PostgreSQL and Galaxy. Once launched, you can use the CloudMan interface to add additional machines to your cluster for processing.</p>
<p><span style="text-align:center; display: block;"><a href="http://bcbio.wordpress.com/2011/11/29/making-next-generation-sequencing-analysis-pipelines-easier-with-biocloudcentral-and-galaxy-integration/"><img src="http://img.youtube.com/vi/NkayXBBAr8I/2.jpg" alt="" /></a></span><br />
</p>
<h2 id="galaxy-pipeline-interface----running-the-analysis">Galaxy pipeline interface &#8212; running the analysis</h2>
<p>This Galaxy instance is a <a href="https://bitbucket.org/hbc/galaxy-central-hbc/overview">fork of the main codebase</a> containing a custom pipeline interface in addition to all of the standard Galaxy tools. It provides an intuitive way to select FASTQ files for processing. Login with the demonstration account (user: example@example.com; password: example) and load FASTQ files along with target and bait BED files into your active history. Then work through the pipeline wizard step by step to start an analysis:</p>
<p><span style="text-align:center; display: block;"><a href="http://bcbio.wordpress.com/2011/11/29/making-next-generation-sequencing-analysis-pipelines-easier-with-biocloudcentral-and-galaxy-integration/"><img src="http://img.youtube.com/vi/dIeQCIi3EXw/2.jpg" alt="" /></a></span><br />
</p>
<p>The Galaxy interface builds a configuration file describing the parameters and inputs, and submits this to the backend analysis server. This server kicks off processing, distributing the analysis across the SGE cluster. For the test data, processing will take approximately 4 hours on a cluster with a single additional work node (Large instance type).</p>
<h2 id="galaxy----retrieving-and-displaying-results">Galaxy &#8212; retrieving and displaying results</h2>
<p>The analysis pipeline uploads the finalized results into Galaxy data libraries. For this demonstration, the example user has results from a previous run in the data library so you don&#8217;t need to wait for the analysis to finish. This folder contains alignment data in BAM format, coverage information in BigWig format, a VCF file of variant calls, a tab separate file with predicted variant effects, and a PDF file of summary information. After importing these into your active Galaxy history, you can perform additional analysis on the data, including visualization in the UCSC genome browser:</p>
<p><span style="text-align:center; display: block;"><a href="http://bcbio.wordpress.com/2011/11/29/making-next-generation-sequencing-analysis-pipelines-easier-with-biocloudcentral-and-galaxy-integration/"><img src="http://img.youtube.com/vi/LtOK3U8990w/2.jpg" alt="" /></a></span><br />
</p>
<p>As a reminder, don&#8217;t forget to terminate your cluster when finished. You can do this either from the CloudMan web interface or the <a href="https://console.aws.amazon.com/ec2/home">Amazon console</a>.</p>
<h2 id="analysis-pipeline-details-and-extending-this-work">Analysis pipeline details and extending this work</h2>
<p>The backend analysis pipeline is a <a href="https://github.com/chapmanb/bcbb/tree/master/nextgen">freely available set of Python modules</a> included on the CloudBioLinux AMI. The pipeline closely follows current <a href="http://www.broadinstitute.org/gsa/wiki/index.php/Best_Practice_Variant_Detection_with_the_GATK_v3">best practice variant detection recommendations from the Broad GATK team</a>:</p>
<ul>
<li>FASTQ alignment with <a href="http://bio-bwa.sourceforge.net/">BWA</a>; <a href="https://github.com/chapmanb/bcbb/blob/master/nextgen/bcbio/ngsalign/bwa.py">source code</a></li>
<li><a href="http://www.broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration">Base quality score recalibration</a> with GATK: <a href="https://github.com/chapmanb/bcbb/blob/master/nextgen/bcbio/variation/recalibrate.py">source code</a></li>
<li><a href="http://www.broadinstitute.org/gsa/wiki/index.php/Local_realignment_around_indels">Local realignment around indels</a> with GATK: <a href="https://github.com/chapmanb/bcbb/blob/master/nextgen/bcbio/variation/realign.py">source code</a>:</li>
<li>Variant calling (SNPs and indels) using the <a href="http://www.broadinstitute.org/gsa/wiki/index.php/Unified_genotyper">GATK Unified Genotyper</a>: <a href="https://github.com/chapmanb/bcbb/blob/master/nextgen/bcbio/variation/genotype.py">source code</a></li>
<li>Variant effect estimation with <a href="http://snpeff.sourceforge.net/">snpEff</a>: <a href="https://github.com/chapmanb/bcbb/blob/master/nextgen/bcbio/variation/effects.py">source code</a></li>
<li>Read coverage visualization with <a href="http://hgdownload.cse.ucsc.edu/admin/exe/">wigToBigWig</a>: <a href="https://github.com/chapmanb/bcbb/blob/master/nextgen/scripts/bam_to_wiggle.py">source code</a></li>
</ul>
<p>The pipeline framework design is general, allowing incorporation of alternative aligners or variant calling algorithms.</p>
<p>We hope that in addition to being directly useful, this framework can fit within the work environments of other developers. The flexible toolkit used is: CloudBioLinux with open source bioinformatics libraries, CloudMan with a managed SGE cluster, Galaxy with a custom pipeline interface, and finally Python to parallelize and manage the processing. We invite you to fork and extend any of the different components. Thank you again to everyone for the amazing feedback on the analysis pipeline and CloudBioLinux.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/bcbio.wordpress.com/251/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/bcbio.wordpress.com/251/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/bcbio.wordpress.com/251/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/bcbio.wordpress.com/251/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/bcbio.wordpress.com/251/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/bcbio.wordpress.com/251/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/bcbio.wordpress.com/251/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/bcbio.wordpress.com/251/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/bcbio.wordpress.com/251/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/bcbio.wordpress.com/251/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/bcbio.wordpress.com/251/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/bcbio.wordpress.com/251/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/bcbio.wordpress.com/251/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/bcbio.wordpress.com/251/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bcbio.wordpress.com&amp;blog=5850073&amp;post=251&amp;subd=bcbio&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://bcbio.wordpress.com/2011/11/29/making-next-generation-sequencing-analysis-pipelines-easier-with-biocloudcentral-and-galaxy-integration/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">Brad Chapman</media:title>
		</media:content>
	</item>
		<item>
		<title>Parallel approaches in next-generation sequencing analysis pipelines</title>
		<link>http://bcbio.wordpress.com/2011/09/10/parallel-approaches-in-next-generation-sequencing-analysis-pipelines/</link>
		<comments>http://bcbio.wordpress.com/2011/09/10/parallel-approaches-in-next-generation-sequencing-analysis-pipelines/#comments</comments>
		<pubDate>Sat, 10 Sep 2011 19:12:50 +0000</pubDate>
		<dc:creator>Brad Chapman</dc:creator>
				<category><![CDATA[OpenBio]]></category>
		<category><![CDATA[bioinformatics]]></category>
		<category><![CDATA[cloud-computing]]></category>
		<category><![CDATA[distributed-computing]]></category>
		<category><![CDATA[ec2]]></category>
		<category><![CDATA[ngs]]></category>

		<guid isPermaLink="false">http://bcbio.wordpress.com/?p=229</guid>
		<description><![CDATA[My last post described a distributed exome analysis pipeline implemented on the CloudBioLinux and CloudMan frameworks. This was a practical introduction to running the pipeline on Amazon resources. Here I&#8217;ll describe how the pipeline runs in parallel, specifically diagramming the workflow to identify points of parallelization during lane and sample processing. Incredible innovation in throughput [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bcbio.wordpress.com&amp;blog=5850073&amp;post=229&amp;subd=bcbio&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>My last post described a <a href="http://bcbio.wordpress.com/2011/08/19/distributed-exome-analysis-pipeline-with-cloudbiolinux-and-cloudman/">distributed exome analysis pipeline</a> implemented on the <a href="http://cloudbiolinux.org">CloudBioLinux</a> and <a href="http://wiki.g2.bx.psu.edu/Admin/Cloud">CloudMan</a> frameworks. This was a practical introduction to running the pipeline on <a href="http://aws.amazon.com/">Amazon resources</a>. Here I&#8217;ll describe how the pipeline runs in parallel, specifically diagramming the workflow to identify points of parallelization during lane and sample processing.</p>
<p>Incredible innovation in throughput makes parallel processing critical for next-generation sequencing analysis. When a single <a href="http://www.illumina.com/systems/hiseq_2000.ilmn">Hi-Seq</a> run can produce 192 samples (2 flowcells x 8 lanes per flowcell x 12 barcodes per lane), the analysis steps quickly become limited by the number of processing cores available.</p>
<p>The heterogeneity of architectures utilized by researchers is a major challenge in building re-usable systems. A pipeline needs to support <a href="http://jermdemo.blogspot.com/2011/06/big-ass-servers-and-myths-of-clusters.html">powerful multi-core servers</a>, clusters and virtual cloud-based machines. The approach we took is to scale at the level of individual samples, lanes and pipelines, exploiting the <a href="http://en.wikipedia.org/wiki/Embarrassingly_parallel">embarassingly parallel</a> nature of the computation. An <a href="http://www.rabbitmq.com/">AMQP messaging queue</a> allows for communication between processes, independent of the system architecture. This flexible approach allows the pipeline to serve as a general framework that can be easily adjusted or expanded to incorporate new algorithms and analysis methods.</p>
<div id="process-overview----points-for-parallel-implementations">
<h2>Process overview &#8212; points for parallel implementations</h2>
<p>The first level of parallelization occurs during processing of each fastq lane. We split the file into individualized barcoded components, followed by alignment and BAM processing. The result is a sorted BAM file for each barcoded sub-sample, given a set of input fastq files:</p>
<p><a href="http://chapmanb.github.com/bcbb/lane_processing.png"> <img src="http://chapmanb.github.com/bcbb/lane_processing.png" width="650px" alt="Initial lane processing" /></a></p>
<p>The pipeline merges samples present in barcodes on multiple lanes, producing a single representative BAM file. The next step parallelizes the processing of each alignment file with read quality assessment, preparation for visualization and variant calling:</p>
<p><a href="http://chapmanb.github.com/bcbb/sample_processing.png"> <img src="http://chapmanb.github.com/bcbb/sample_processing.png" width="650px" alt="Sample processing overview" /></a></p>
<p>The variant calling steps utilize <a href="http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit">The Genome Analysis Toolkit (GATK)</a> from the Broad Institute. It prepares alignments by recalibrating initial quality scores given the aligned sequences and consistently realigning reads around indels. The Unified Genotyper identifies variants from this prepared alignment file, then uses these variants along with known true sites for assigning quality scores and filtering to a final set of calls:</p>
<p><a href="http://chapmanb.github.com/bcbb/variant_call.png"> <img src="http://chapmanb.github.com/bcbb/variant_call.png" width="650px" alt="GATK variant calling details" /></a></p>
<p>Subsequent steps include <a href="http://snpeff.sourceforge.net/">assessment of variant effects using snpEff</a> and <a href="http://www.broadinstitute.org/gsa/wiki/index.php/Read-backed_phasing_algorithm">haplotype phasing of variants</a> in diploid organism analyses.</p>
</div>
<div id="messaging-approach-to-parallel-execution">
<h2>Messaging approach to parallel execution</h2>
<p>The process diagrams illustrate points of parallel execution for each fastq file and sample analysis. Practically, a top level analysis server manages each of the sub-processes. A command line script, a LIMS system or a specialized Galaxy interface start this top level process. RabbitMQ messaging facilitates communication between the analysis controller and processing nodes:</p>
<p><a href="http://chapmanb.github.com/bcbb/parallel_messaging.png"> <img src="http://chapmanb.github.com/bcbb/parallel_messaging.png" width="650px" alt="Messaging approach" /></a></p>
<p>In <a href="http://bcbio.wordpress.com/2011/08/19/distributed-exome-analysis-pipeline-with-cloudbiolinux-and-cloudman/">my previous post</a>, CloudMan manages this entire process. The web interface controls a pre-configured SGE cluster and a custom script starts the job on this cluster. However, the general nature of the pipeline architecture allows this to work equally well on multiple core machines or a heterogeneous set of connected machines.</p>
<p>The CloudMan work demonstrates that clusters, especially on-demand virtual images like those available from Amazon, are be a powerful way to scale analyses. Equally important, it provides an open platform to share these pipelines and encourage re-use. The code for the pipeline is available from the <a href="https://github.com/chapmanb/bcbb/tree/master/nextgen">bcbio-nextgen GitHub repository</a></p>
</div>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/bcbio.wordpress.com/229/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/bcbio.wordpress.com/229/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/bcbio.wordpress.com/229/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/bcbio.wordpress.com/229/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/bcbio.wordpress.com/229/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/bcbio.wordpress.com/229/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/bcbio.wordpress.com/229/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/bcbio.wordpress.com/229/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/bcbio.wordpress.com/229/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/bcbio.wordpress.com/229/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/bcbio.wordpress.com/229/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/bcbio.wordpress.com/229/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/bcbio.wordpress.com/229/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/bcbio.wordpress.com/229/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bcbio.wordpress.com&amp;blog=5850073&amp;post=229&amp;subd=bcbio&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://bcbio.wordpress.com/2011/09/10/parallel-approaches-in-next-generation-sequencing-analysis-pipelines/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">Brad Chapman</media:title>
		</media:content>

		<media:content url="http://chapmanb.github.com/bcbb/lane_processing.png" medium="image">
			<media:title type="html">Initial lane processing</media:title>
		</media:content>

		<media:content url="http://chapmanb.github.com/bcbb/sample_processing.png" medium="image">
			<media:title type="html">Sample processing overview</media:title>
		</media:content>

		<media:content url="http://chapmanb.github.com/bcbb/variant_call.png" medium="image">
			<media:title type="html">GATK variant calling details</media:title>
		</media:content>

		<media:content url="http://chapmanb.github.com/bcbb/parallel_messaging.png" medium="image">
			<media:title type="html">Messaging approach</media:title>
		</media:content>
	</item>
		<item>
		<title>Distributed exome analysis pipeline with CloudBioLinux and CloudMan</title>
		<link>http://bcbio.wordpress.com/2011/08/19/distributed-exome-analysis-pipeline-with-cloudbiolinux-and-cloudman/</link>
		<comments>http://bcbio.wordpress.com/2011/08/19/distributed-exome-analysis-pipeline-with-cloudbiolinux-and-cloudman/#comments</comments>
		<pubDate>Fri, 19 Aug 2011 21:33:16 +0000</pubDate>
		<dc:creator>Brad Chapman</dc:creator>
				<category><![CDATA[analysis]]></category>
		<category><![CDATA[bioinformatics]]></category>
		<category><![CDATA[cloud-computing]]></category>
		<category><![CDATA[distributed-computing]]></category>
		<category><![CDATA[ec2]]></category>
		<category><![CDATA[ngs]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://bcbio.wordpress.com/?p=217</guid>
		<description><![CDATA[A major challenge in building analysis pipelines for next-generation sequencing data is combining a large number of processing steps in a flexible, scalable manner. Current best-practice software needs to be installed and configured alongside the custom code to chain individual programs together. Scaling to handle increasing throughput requires running that custom code on a wide [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bcbio.wordpress.com&amp;blog=5850073&amp;post=217&amp;subd=bcbio&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>A major challenge in building analysis pipelines for next-generation sequencing data is combining a large number of processing steps in a flexible, scalable manner. Current best-practice software needs to be installed and configured alongside the custom code to chain individual programs together. Scaling to handle increasing throughput requires running that custom code on a wide variety of parallel architectures, from single multicore machines to heterogeneous clusters.</p>
<p>Establishing community resources that meet the challenges of building these pipelines ensures that bioinformatics programmers can share the burden of building large scale systems. Two open-source efforts which aim at providing this type of architecture are:</p>
<ul>
<li>
<p><a href="http://cloudbiolinux.org">CloudBioLinux</a> &#8212; A community effort to create shared images filled with bioinformatics software and libraries, using an automated build environment.</p>
</li>
<li>
<p><a href="http://wiki.g2.bx.psu.edu/Admin/Cloud">CloudMan</a> &#8212; Uses CloudBioLinux as a platform to build a full SGE cluster environment. Written by <a href="http://userwww.service.emory.edu/~eafgan/">Enis Afgan</a> and the <a href="http://wiki.g2.bx.psu.edu/">Galaxy Team</a>, CloudMan is used to provide a ready-to-run, dynamically scalable version of Galaxy on <a href="http://aws.amazon.com/">Amazon AWS</a>.</p>
</li>
</ul>
<p>Here we combine CloudBioLinux software with a CloudMan SGE cluster to build a fully automated pipeline for processing high throughput <a href="http://en.wikipedia.org/wiki/Exome_Sequencing">exome sequencing</a> data:</p>
<ul>
<li>The underlying analysis software is from CloudBioLinux.</li>
<li>CloudMan provides an SGE cluster managed via a web front end.</li>
<li><a href="http://www.rabbitmq.com/">RabbitMQ</a> is used for communication between cluster nodes.</li>
<li><a href="https://github.com/chapmanb/bcbb/tree/master/nextgen">An automated pipeline</a>, written in <a href="http://python.org">Python</a>, organizes parallel processing across the cluster.</li>
</ul>
<p>Below are instructions for starting a cluster on <a href="http://aws.amazon.com/ec2/">Amazon EC2</a> resources to run an exome sequencing pipeline that processes <a href="http://en.wikipedia.org/wiki/FASTQ_format">FASTQ</a> sequencing reads, producing fully annotated <a href="http://en.wikipedia.org/wiki/Variant_Call_Format">variant calls</a>.</p>
<div id="start-cluster-with-cloudbiolinux-and-cloudman">
<h2>Start cluster with CloudBioLinux and CloudMan</h2>
<p>Start in the <a href="https://console.aws.amazon.com/ec2/">Amazon web console</a>, a convenient front end for managing EC2 servers. The first step is to follow the <a href="http://wiki.g2.bx.psu.edu/Admin/Cloud">CloudMan setup instructions</a> to create an Amazon account and set up appropriate security groups and user data. The <a href="http://wiki.g2.bx.psu.edu/Admin/Cloud">wiki page</a> contains detailed screencasts. Below is a short screencast showing how to boot your CloudBioLinux specific CloudMan server:</p>
<span style="text-align:center; display: block;"><a href="http://bcbio.wordpress.com/2011/08/19/distributed-exome-analysis-pipeline-with-cloudbiolinux-and-cloudman/"><img src="http://img.youtube.com/vi/eS8vmKIXIB4/2.jpg" alt="" /></a></span>
<p></p>
<p>Once this is booted, proceed to the CloudMan web interface on the server and startup an instance from this shared identifier:</p>
<pre><code>cm-b53c6f1223f966914df347687f6fc818/shared/2011-10-07--14-00
</code></pre>
<p></p>
<p>This screencast shows all of the details, including starting an additional node on the SGE cluster:</p>
<p><span style="text-align:center; display: block;"><a href="http://bcbio.wordpress.com/2011/08/19/distributed-exome-analysis-pipeline-with-cloudbiolinux-and-cloudman/"><img src="http://img.youtube.com/vi/4kIRI1m0g7Y/2.jpg" alt="" /></a></span><br />
</p>
</div>
<div id="configure-amqp-messaging">
<h2>Configure AMQP messaging</h2>
<p><b>Edit:</b> The AMQP messaging steps have now been full automated so the configuration steps in this section are no longer required. Skip down to the &#8216;Run Analysis&#8217; section to start processing the data immediately.
</p>
<p>With your server booted and ready to run, the next step is to configure RabbitMQ messaging to communicate between nodes on your cluster. In the AWS console, find the external and internal hostname of the head machine. Start by opening an ssh connection to the machine with the external hostname:</p>
<pre><code>$ ssh -i your-keypair ubuntu@ec2-50-19-177-134.compute-1.amazonaws.com
</code></pre>
<p></p>
<p>Edit the <code>/export/data/galaxy/universe_wsgi.ini</code> configuration file to add the internal hostname. After editing, the AMQP section will look like:</p>
<pre><code>[galaxy_amqp]
host = ip-10-125-10-182.ec2.internal
port = 5672
userid = biouser
password = tester
</code></pre>
<p></p>
<p>Finally, add the user and virtual host to the running RabbitMQ server on the master node with 3 commands:</p>
<pre><code>$ sudo rabbitmqctl add_user biouser tester
creating user &quot;biouser&quot; ...
...done.
$ sudo rabbitmqctl add_vhost bionextgen
creating vhost &quot;bionextgen&quot; ...
...done.
$ sudo rabbitmqctl set_permissions -p bionextgen biouser &quot;.*&quot; &quot;.*&quot; &quot;.*&quot;
setting permissions for user &quot;biouser&quot; in vhost &quot;bionextgen&quot; ...
...done.
</code></pre>
<p>
</div>
<div id="run-analysis">
<h2>Run analysis</h2>
<p>With messaging in place, we are ready to run the analysis. <code>/export/data</code> contains a ready to run example exome analysis, with <a href="http://en.wikipedia.org/wiki/FASTQ_format">FASTQ</a> input files in <code>/export/data/exome_example/fastq</code> and configuration information in <code>/export/data/exome_example/config</code>. Start the fully automated pipeline with a single command:</p>
<pre><code> $ cd /export/data/work
 $ distributed_nextgen_pipeline.py /export/data/galaxy/post_process.yaml
                                   /export/data/exome_example/fastq
                                   /export/data/exome_example/config/run_info.yaml
</code></pre>
<p></p>
<p><code>distributed_nextgen_pipeline.py</code> starts processing servers on each of the cluster nodes, using SGE for scheduling. Then a top level analysis server runs, splitting the FASTQ data across the nodes at each step of the process:</p>
<ul>
<li>Alignment with <a href="http://bio-bwa.sourceforge.net/">BWA</a></li>
<li>Preparation of merged alignment files with <a href="http://picard.sourceforge.net/">Picard</a></li>
<li>Recalibration and realignment with <a href="http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit">GATK</a></li>
<li>Variant calling with GATK</li>
<li>Assessment of predicted variant effects with <a href="http://snpeff.sourceforge.net/">snpEff</a></li>
<li>Preparation of summary PDFs for each sample with read details from <a href="http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/">FastQC</a> alongside alignment, hybrid selection and variant calling statistics from Picard</li>
</ul>
</div>
<div id="monitor-the-running-process">
<h2>Monitor the running process</h2>
<p>The example data is from a human chromosome 22 hybrid selection experiment. While running, you can keep track of the progress in several ways. SGEs <code>qstat</code> command will tell you where the analysis servers are running on the cluster:</p>
<pre><code>$ qstat
ob-ID  prior   name   user  state submit/start at   queue
----------------------------------------------------------------------------------
1 0.55500 nextgen_an ubuntu  r  08/14/2011 18:16:32 all.q@ip-10-125-10-182.ec2.int
2 0.55500 nextgen_an ubuntu  r  08/14/2011 18:16:32 all.q@ip-10-86-254-105.ec2.int
3 0.55500 automated_ ubuntu  r  08/14/2011 18:16:47 all.q@ip-10-125-10-182.ec2.int
</code></pre>
<p></p>
<p>Listing files in the working directory will show our progress:</p>
<pre><code>$ cd /export/data/work
$ ls -lh
drwxr-xr-x 2 ubuntu ubuntu 4.0K 2011-08-13 21:09 alignments
-rw-r--r-- 1 ubuntu ubuntu 2.0K 2011-08-13 21:17 automated_initial_analysis.py.o11
drwxr-xr-x 2 ubuntu ubuntu   33 2011-08-13 20:43 log
-rw-r--r-- 1 ubuntu ubuntu  15K 2011-08-13 21:17 nextgen_analysis_server.py.o10
-rw-r--r-- 1 ubuntu ubuntu  15K 2011-08-13 21:17 nextgen_analysis_server.py.o9
drwxr-xr-x 8 ubuntu ubuntu  102 2011-08-13 21:06 tmp
</code></pre>
<p></p>
<p>The files that end with <code>.o*</code> are log files from each of the analysis servers and provide detailed information about the current state of processing at each server:</p>
<pre><code>$ less nextgen_analysis_server.py.o10
INFO: nextgen_pipeline: Processing sample: Test replicate 2; lane
  8; reference genome hg19; researcher ; analysis method SNP calling
INFO: nextgen_pipeline: Aligning lane 8_100326_FC6107FAAXX with bwa aligner
INFO: nextgen_pipeline: Combining and preparing wig file [u'', u'Test replicate 2']
INFO: nextgen_pipeline: Recalibrating [u'', u'Test replicate 2'] with GATK
</code></pre>
<p></p>
</div>
<div id="retrieve-results">
<h2>Retrieve results</h2>
<p>The processing pipeline results in numerous intermediate files. These take up a lot of disk space and are not necessary after processing is finished. The final step in the process is to extract the useful files for visualization and further analysis:</p>
<pre><code>$ upload_to_galaxy.py /export/data/galaxy/post_process.yaml
                      /export/data/exome_example/fastq
                      /export/data/work
                      /export/data/exome_example/config/run_info.yaml
</code></pre>
<p></p>
<p>For each sample, this script copies:</p>
<ul>
<li>A <a href="http://samtools.sourceforge.net/SAM1.pdf">BAM file</a> with aligned sequeneces and original FASTQ data</li>
<li>A realigned and recalibrated BAM file, ready for variant calling</li>
<li>Variant calls in <a href="http://en.wikipedia.org/wiki/Variant_Call_Format">VCF format</a>.</li>
<li>A tab delimited file of predicted variant effects.</li>
<li>A PDF summary file containing alignment, variant calling and hybrid selection statistics.</li>
</ul>
<p>into an output directory for the flowcell: <code>/export/data/galaxy/storage/100326_FC6107FAAXX</code>:</p>
<pre><code>$ ls -lh /export/data/galaxy/storage/100326_FC6107FAAXX/7
-rw-r--r-- 1 ubuntu ubuntu  38M 2011-08-19 20:50 7_100326_FC6107FAAXX.bam
-rw-r--r-- 1 ubuntu ubuntu  22M 2011-08-19 20:50 7_100326_FC6107FAAXX-coverage.bigwig
-rw-r--r-- 1 ubuntu ubuntu  72M 2011-08-19 20:51 7_100326_FC6107FAAXX-gatkrecal.bam
-rw-r--r-- 1 ubuntu ubuntu 109K 2011-08-19 20:51 7_100326_FC6107FAAXX-snp-effects.tsv
-rw-r--r-- 1 ubuntu ubuntu 827K 2011-08-19 20:51 7_100326_FC6107FAAXX-snp-filter.vcf
-rw-r--r-- 1 ubuntu ubuntu 1.6M 2011-08-19 20:50 7_100326_FC6107FAAXX-summary.pd
</code></pre>
<p></p>
<p>As suggested by the name, the script can also integrate the data into a <a href="http://usegalaxy.org">Galaxy instance</a> if desired. This allows biologists to perform further data analysis, including visual inspection of the alignments in the <a href="http://genome.ucsc.edu/">UCSC browser</a>.</p>
</div>
<div id="learn-more">
<h2>Learn more</h2>
<p>All components of the pipeline are open source and part of community projects. CloudMan, CloudBioLinux and the pipeline are customized through <a href="http://en.wikipedia.org/wiki/YAML">YAML</a> configuration files. Combined with the CloudMan managed SGE cluster, the pipeline can be applied in parallel to any number of samples.</p>
<p>The overall goal is to share the automated infrastructure work that moves samples from sequencing to being ready for analysis. This allows biologists more rapid access to the processed data, focusing attention on the real work: answering scientific questions.</p>
<p>If you&#8217;d like to hear more about CloudBioLinux, CloudMan and the exome sequencing pipeline, I&#8217;ll be discussing it at the <a href="http://aws.amazon.com/genomicsevent/">AWS Genomics Event</a> in Seattle on September 22nd.</p>
</div>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/bcbio.wordpress.com/217/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/bcbio.wordpress.com/217/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/bcbio.wordpress.com/217/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/bcbio.wordpress.com/217/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/bcbio.wordpress.com/217/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/bcbio.wordpress.com/217/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/bcbio.wordpress.com/217/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/bcbio.wordpress.com/217/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/bcbio.wordpress.com/217/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/bcbio.wordpress.com/217/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/bcbio.wordpress.com/217/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/bcbio.wordpress.com/217/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/bcbio.wordpress.com/217/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/bcbio.wordpress.com/217/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bcbio.wordpress.com&amp;blog=5850073&amp;post=217&amp;subd=bcbio&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://bcbio.wordpress.com/2011/08/19/distributed-exome-analysis-pipeline-with-cloudbiolinux-and-cloudman/feed/</wfw:commentRss>
		<slash:comments>15</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">Brad Chapman</media:title>
		</media:content>
	</item>
		<item>
		<title>Summarizing next-gen sequencing variation statistics with Hadoop using Cascalog</title>
		<link>http://bcbio.wordpress.com/2011/07/04/summarizing-next-gen-sequencing-variation-statistics-with-hadoop-using-cascalog/</link>
		<comments>http://bcbio.wordpress.com/2011/07/04/summarizing-next-gen-sequencing-variation-statistics-with-hadoop-using-cascalog/#comments</comments>
		<pubDate>Tue, 05 Jul 2011 01:25:14 +0000</pubDate>
		<dc:creator>Brad Chapman</dc:creator>
				<category><![CDATA[analysis]]></category>
		<category><![CDATA[bioinformatics]]></category>
		<category><![CDATA[cascalog]]></category>
		<category><![CDATA[clojure]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[ngs]]></category>

		<guid isPermaLink="false">http://bcbio.wordpress.com/?p=199</guid>
		<description><![CDATA[Improvements in next-generation sequencing technology are leading to ever increasing amounts of sequencing data. With this additional throughput comes the demand for algorithms and approaches that can easily scale. Hadoop offers an open source framework for batch processing large files. This post describes using Cascalog, a Hadoop query language written in Clojure, to investigate quality [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bcbio.wordpress.com&amp;blog=5850073&amp;post=199&amp;subd=bcbio&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Improvements in next-generation sequencing technology are leading to ever increasing amounts of sequencing data. With this additional throughput comes the demand for algorithms and approaches that can easily scale. <a href="http://hadoop.apache.org/">Hadoop</a> offers an open source framework for batch processing large files. This post describes using <a href="https://github.com/nathanmarz/cascalog">Cascalog</a>, a Hadoop query language written in <a href="http://clojure.org/">Clojure</a>, to investigate quality statistics for variant calling in deeply sequenced regions.</p>
<div id="biological-question">
<h3>Biological question</h3>
<p>The goal is to improve a variation calling algorithm for next-generation sequencing data. We have a densely sequenced region, where each position has thousands of potential base calls. Each position may be a single base, or a mix of of majority and minority variants. We are filtering variants on 3 metrics of quality:</p>
<ul>
<li>Quality score &#8212; The sequencing technology&#8217;s assessment of the correctness of a base.</li>
<li>K-mer score &#8212; An estimate of the uniqueness of the region surrounding the base call position, built using <a href="https://github.com/ctb/khmer">khmer</a>. Unique regions are more likely to be sequencing artifacts, while common regions are more likely to be real.</li>
<li>Mapping score &#8212; The aligner&#8217;s estimate of the reliability of the read alignment.</li>
</ul>
<p>Each read and position is in a <a href="https://github.com/hbc/projects/blob/master/snp-assess/test/data/raw_variations.tsv">tab delimited file</a> that looks like:</p>
<pre><code>951     G       31      0.0515130211584 198</code></pre>
<p>
<p>The <a href="https://github.com/hbc/projects/blob/master/snp-assess/test/positions/pos_to_examine.tsv">training data</a> has a set of known variable positions, and details about how the current variant calling algorithm did at each position:</p>
<pre><code>951     T       false_positive  0.7
953     A       true_positive   4.0
</code></pre>
<p>
<p>We wanted to generate summary statistics at each position of interest, and look for additional criteria that could be built into the calling algorithm.</p>
</div>
<div id="writing-cascalog-queries">
<h3>Writing cascalog queries</h3>
<p>Cascalog is based on the <a href="http://en.wikipedia.org/wiki/Datalog">Datalog</a> rule language, a subset of Prolog. You describe the rules of a system and the query optimizer figures out how best to satisfy them; it requires a change of mindset from the more standard approach that you need to write detailed instructions about what to do.</p>
<p>Cascalog provides a high level of abstraction over Hadoop and Map-Reduce, so you focus entirely on writing the query. <a href="http://blog.piccolboni.info/2011/04/looking-for-map-reduce-language.html">This post from Antonio Piccolboni</a> compares several Hadoop languages; the post provides a nice side-by-side example of the brevity you can achieve with Cascalog.</p>
<p>The main query defines the outputs, retrieves input data from the <code>snpdata</code> and location <code>target</code> files described above, provides a count of reads at each position and base of interest, then averages the kmer, quality and mapping score metrics described earlier:</p>
<p><pre class="brush: clojure;">
(defn calc-snpdata-stats [snpdata targets]
  (??&lt;- [?chr ?pos ?base ?count ?avg-score ?avg-kmer-pct
         ?avg-qual ?avg-map ?type]
        (snpdata ?chr ?pos ?base ?qual ?kmer-pct ?map-score)
        (targets ?chr ?pos ?base ?type)
        (ops/count ?count)
        (ops/avg ?kmer-pct :&gt; ?avg-kmer-pct)
        (ops/avg ?qual :&gt; ?avg-qual)
        (ops/avg ?map-score :&gt; ?avg-map)
        (combine-score ?kmer-pct ?qual ?map-score :&gt; ?score)
        (ops/avg ?score :&gt; ?avg-score)))
</pre></p>
<p>A big advantage of Cascalog is that it is just Clojure, so you can write custom queries in a full-featured language. The last two lines of the query define a custom score and its average at a position. The custom score is a linear combination of the <a href="http://wiki.answers.com/Q/What_is_min-max_normalization">min-max normalized</a> scores:</p>
<p><pre class="brush: clojure;">
(defn min-max-norm [score minv maxv]
  (let [trunc-score-max (if (&lt; score maxv) score maxv)
        trunc-score (if (&gt; trunc-score-max minv) trunc-score-max minv)]
    (/ (- trunc-score minv) (- maxv minv))))

(defmapop combine-score [kmer-pct qual map-score]
  (+ (min-max-norm kmer-pct 1e-5 0.10)
     (min-max-norm qual 4.0 35.0)
     (min-max-norm map-score 0.0 250.0)))
</pre></p>
<p>The final part of the code involves parsing the files and producing the <code>snpdata</code> and <code>targets</code> inputs to the query. That code splits each line in the input file and assigns the parts to the variables of interest:</p>
<p><pre class="brush: clojure;">
(defmapop parse-snpdata-line [line]
  (let [[space pos base qual kmer-pct map-score] (split line #&quot;\t&quot;)]
    [space (Integer/parseInt pos) base (Integer/parseInt qual)
     (Float/parseFloat kmer-pct) (Integer/parseInt map-score)]))

(defn snpdata-from-hfs [dir]
  (let [source (hfs-textline dir)]
    (&lt;- [?chr ?pos ?base ?qual ?kmer-pct ?map-score]
        (source ?line)
        (parse-snpdata-line ?line :&gt; ?chr ?pos ?base ?qual
                                     ?kmer-pct ?map-score))))
</pre></p>
</div>
<div id="running-on-hadoop">
<h3>Running on Hadoop</h3>
<p>The <a href="https://github.com/hbc/projects/tree/master/snp-assess">full project is available on GitHub</a>. To run on a <a href="https://ccp.cloudera.com/display/CDHDOC/CDH3+Installation">configured Hadoop system</a>, you build the code, copy your input files to HDFS, then run:</p>
<pre><code>% lein deps
% lein uberjar
% hadoop fs -mkdir /tmp/snp-assess/data
% hadoop fs -mkdir /tmp/snp-assess/positions
% hadoop fs -put your_variation_data.tsv /tmp/snp-assess/data
% hadoop fs -put positions_of_interest.tsv /tmp/snp-assess/positions
% hadoop jar snp-assess-0.0.1-SNAPSHOT-standalone.jar
             snp_assess.core /tmp/snp-assess/data /tmp/snp-assess/positions
</code></pre>
<p>
<p>The same code can also run locally without Hadoop. This is extremely useful for testing and development, or for smaller datasets that do not require the distributed power of Hadoop:</p>
<pre><code>% lein deps
% lein run :snp-data /directory/of/varation/data /directory/of/positions
</code></pre>
<p>Both approaches generate tabular output with our positions, counts, scores and average metrics:</p>
<pre><code>| 951 | T |  3 | 0.9 | 2.0e-04 | 24.7 | 55.7  | false_positive |
| 953 | A | 10 | 1.5 | 1.6e-02 | 23.1 | 175.5 | true_positive  |
</code></pre>
</div>
<p>
<div id="overview-and-additional-projects">
<h3>Overview and additional projects</h3>
<p>Cascalog provided an easy to use abstraction on top of Hadoop, which enabled exploration of densely mapped next-generation sequencing reads for variant detection. The code is free of scaling specific details, and instead focuses purely on the data of interest.</p>
<p>Another example of Cascalog in a biological setting is the answer I wrote to <a href="http://biostar.stackexchange.com/questions/8821/hadoop-genomic-segments-and-join">Pierre&#8217;s question on BioStar</a>, dealing with overlapping genomic segments within Hadoop. The <a href="https://github.com/chapmanb/bcbb/tree/master/biostar/bed-hadoop">code is available from GitHub</a> as an additional starting point for getting oriented with Hadoop and Cascalog.</p>
</div>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/bcbio.wordpress.com/199/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/bcbio.wordpress.com/199/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/bcbio.wordpress.com/199/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/bcbio.wordpress.com/199/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/bcbio.wordpress.com/199/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/bcbio.wordpress.com/199/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/bcbio.wordpress.com/199/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/bcbio.wordpress.com/199/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/bcbio.wordpress.com/199/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/bcbio.wordpress.com/199/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/bcbio.wordpress.com/199/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/bcbio.wordpress.com/199/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/bcbio.wordpress.com/199/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/bcbio.wordpress.com/199/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bcbio.wordpress.com&amp;blog=5850073&amp;post=199&amp;subd=bcbio&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://bcbio.wordpress.com/2011/07/04/summarizing-next-gen-sequencing-variation-statistics-with-hadoop-using-cascalog/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">Brad Chapman</media:title>
		</media:content>
	</item>
		<item>
		<title>Bioinformatics jobs at Harvard School of Public Health</title>
		<link>http://bcbio.wordpress.com/2011/04/10/bioinformatics-jobs-at-harvard-school-of-public-health/</link>
		<comments>http://bcbio.wordpress.com/2011/04/10/bioinformatics-jobs-at-harvard-school-of-public-health/#comments</comments>
		<pubDate>Sun, 10 Apr 2011 19:25:11 +0000</pubDate>
		<dc:creator>Brad Chapman</dc:creator>
				<category><![CDATA[analysis]]></category>
		<category><![CDATA[bioinformatics]]></category>
		<category><![CDATA[ngs]]></category>
		<category><![CDATA[OpenBio]]></category>

		<guid isPermaLink="false">http://bcbio.wordpress.com/?p=193</guid>
		<description><![CDATA[I&#8217;ve recently moved positions to the bioinformatics core at Harvard School of Public Health. It&#8217;s a great place to do science, with plenty of researchers doing interesting work and actively looking for bioinformatics collaborators. The team, working alongside members of the Hide Lab, is passionate about open source work. Both qualities made it a great [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bcbio.wordpress.com&amp;blog=5850073&amp;post=193&amp;subd=bcbio&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve recently moved positions to the <a href="http://www.hsph.harvard.edu/research/bioinfocore/">bioinformatics core at Harvard School of Public Health</a>. It&#8217;s a great place to do science, with plenty of researchers doing interesting work and actively looking for bioinformatics collaborators. The team, working alongside members of the <a href="http://web.me.com/winhide/Win_Hide_Lab/Home.html">Hide Lab</a>, is passionate about open source work. Both qualities made it a great fit for my interests and experience.</p>
<p>My new group is currently hiring bioinformatics researchers. The work involves interacting collaboratively with a research group to understand their biological problem, creatively attacking the mountains of data underlying the research question, and presenting the results back in an intuitive fashion. On the programming side, it&#8217;s an opportunity to combine existing published toolkits with your own custom algorithms and approaches. On the biology side, you should be passionate and interested in thinking of novel ways to advance our understanding of the problems. Practically, all of this work will involve a wide range of technologies and approaches; I expect plenty of next-generation sequencing data and lots of learning about the best ways to scale analyses.</p>
<p>Our other goal is to build re-usable tools for the larger research community. We work extensively with analysis frameworks like <a href="http://usegalaxy.org">Galaxy</a> and open standards like <a href="http://isatab.sourceforge.net/">ISA-Tab</a>. We hope to extract the common parts from disparate experiments to build abstractions that help get new analyses done quicker. Tool building also involves automating and deploying analysis pipelines in a way that allows biologists to run them directly. By democratizing analyses and presenting results to researchers at a high level they can directly interact with, science is accelerated and the world becomes an awesomer place.</p>
<p>So if you enjoy the work I write about here, and have always secretly wanted to sit in an office right next to me, now is your big chance (no stalkers, please). If this sounds of interest, please get in touch and I&#8217;d be happy to pass along more details.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/bcbio.wordpress.com/193/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/bcbio.wordpress.com/193/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/bcbio.wordpress.com/193/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/bcbio.wordpress.com/193/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/bcbio.wordpress.com/193/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/bcbio.wordpress.com/193/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/bcbio.wordpress.com/193/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/bcbio.wordpress.com/193/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/bcbio.wordpress.com/193/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/bcbio.wordpress.com/193/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/bcbio.wordpress.com/193/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/bcbio.wordpress.com/193/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/bcbio.wordpress.com/193/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/bcbio.wordpress.com/193/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bcbio.wordpress.com&amp;blog=5850073&amp;post=193&amp;subd=bcbio&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://bcbio.wordpress.com/2011/04/10/bioinformatics-jobs-at-harvard-school-of-public-health/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">Brad Chapman</media:title>
		</media:content>
	</item>
		<item>
		<title>Parallel upload to Amazon S3 with python, boto and multiprocessing</title>
		<link>http://bcbio.wordpress.com/2011/04/10/parallel-upload-to-amazon-s3-with-python-boto-and-multiprocessing/</link>
		<comments>http://bcbio.wordpress.com/2011/04/10/parallel-upload-to-amazon-s3-with-python-boto-and-multiprocessing/#comments</comments>
		<pubDate>Sun, 10 Apr 2011 18:27:10 +0000</pubDate>
		<dc:creator>Brad Chapman</dc:creator>
				<category><![CDATA[analysis]]></category>
		<category><![CDATA[bioinformatics]]></category>
		<category><![CDATA[cloud-computing]]></category>
		<category><![CDATA[ec2]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://bcbio.wordpress.com/?p=187</guid>
		<description><![CDATA[One challenge with moving analysis pipelines to cloud resources like Amazon EC2 is figuring out the logistics of transferring files. Biological data is big; with the rapid adoption of new machines like the HiSeq and decreasing sequencing costs, the data transfer question isn&#8217;t going away soon. The use of Amazon in bioinformatics was brought up [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bcbio.wordpress.com&amp;blog=5850073&amp;post=187&amp;subd=bcbio&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>One challenge with moving analysis pipelines to cloud resources like <a href="http://aws.amazon.com/ec2/">Amazon EC2</a> is figuring out the logistics of transferring files. Biological data is big; with the rapid adoption of new machines like the <a href="http://www.illumina.com/systems/hiseq_2000.ilmn">HiSeq</a> and decreasing <a href="http://www.genome.gov/sequencingcosts/">sequencing costs</a>, the data transfer question isn&#8217;t going away soon. The use of Amazon in bioinformatics was brought up during a recent <a href="http://biostar.stackexchange.com/questions/7143/is-amazons-ec2-commonly-used-for-bioinformatics">discussion on the BioStar question answer site</a>. <a href="http://mndoci.github.com/">Deepak&#8217;s</a> answer highlighted the role of parallelizing uploads and downloads to ease this transfer burden. Here I describe a method to improve upload speed by splitting over multiple processing cores.</p>
<p><a href="http://aws.amazon.com/s3/faqs/#What_is_Amazon_S3">Amazon Simple Storage System (S3)</a> provides relatively inexpensive cloud storage with their <a href="http://aws.amazon.com/s3/faqs/#What_is_RRS">reduced redundancy storage</a> option. S3, and all of Amazon&#8217;s cloud services, are accessible directly from Python using <a href="http://code.google.com/p/boto/">boto</a>. By using <a href="http://www.elastician.com/2010/12/s3-multipart-upload-in-boto.html">boto&#8217;s multipart upload support</a>, coupled with Python&#8217;s built in <a href="http://docs.python.org/library/multiprocessing.html">multiprocessing</a> module, I&#8217;ll demonstrate maximizing transfer speeds to make uploading data less painful. The <a href="https://github.com/chapmanb/cloudbiolinux/blob/master/utils/s3_multipart_upload.py">script is available from GitHub</a> and requires <a href="https://github.com/boto/boto">the latest boto from GitHub (2.0b5 or better)</a>.</p>
<div id="parallel-upload-with-multiprocessing">
<h2>Parallel upload with multiprocessing</h2>
<p>The overall process uses boto to connect to an S3 upload bucket, initialize a multipart transfer, split the file into multiple pieces, and then upload these pieces in parallel over multiple cores. Each processing core is passed a set of credentials to identify the transfer: the multipart upload identifier (<code>mp.id</code>), the S3 file key name (<code>mp.key_name</code>) and the S3 bucket name (<code>mp.bucket_name</code>).</p>
<p><pre class="brush: python;">
import boto

conn = boto.connect_s3()
bucket = conn.lookup(bucket_name)
mp = bucket.initiate_multipart_upload(s3_key_name, reduced_redundancy=use_rr)
with multimap(cores) as pmap:
    for _ in pmap(transfer_part, ((mp.id, mp.key_name, mp.bucket_name, i, part)
                                  for (i, part) in
                                  enumerate(split_file(tarball, mb_size, cores)))):
        pass
mp.complete_upload()
</pre></p>
<p>The <code>split_file</code> function uses the unix split command to divide the file into sections, each of which will be uploaded separately.</p>
<p><pre class="brush: python;">
def split_file(in_file, mb_size, split_num=5):
    prefix = os.path.join(os.path.dirname(in_file),
                          &quot;%sS3PART&quot; % (os.path.basename(s3_key_name)))
    split_size = int(min(mb_size / (split_num * 2.0), 250))
    if not os.path.exists(&quot;%saa&quot; % prefix):
        cl = [&quot;split&quot;, &quot;-b%sm&quot; % split_size, in_file, prefix]
        subprocess.check_call(cl)
    return sorted(glob.glob(&quot;%s*&quot; % prefix))
</pre></p>
<p>The multiprocessing aspect is managed using a <a href="http://docs.python.org/library/contextlib.html">contextmanager</a>. The initial multiprocessing pool is setup, using a specified number of cores, and configured to allow keyboard interrupts. We then return a lazy map function (<a href="http://docs.python.org/library/itertools.html#itertools.imap">imap</a>) which can be used just like Python&#8217;s standard <code>map</code>. This transparently divides the function calls for each file part over all available cores. Finally, the pool is cleaned up when the map is finished running.</p>
<p><pre class="brush: python;">
@contextlib.contextmanager
def multimap(cores=None):
    if cores is None:
        cores = max(multiprocessing.cpu_count() - 1, 1)
    def wrapper(func):
        def wrap(self, timeout=None):
            return func(self, timeout=timeout if timeout is not None else 1e100)
        return wrap
    IMapIterator.next = wrapper(IMapIterator.next)
    pool = multiprocessing.Pool(cores)
    yield pool.imap
    pool.terminate()
</pre></p>
<p>The actual work of transferring each portion of the file is done using two functions. The helper function, <code>mp_from_ids</code>, uses the id information about the bucket, file key and multipart upload id to reconstitute a multipart upload object:</p>
<p><pre class="brush: python;">
def mp_from_ids(mp_id, mp_keyname, mp_bucketname):
    conn = boto.connect_s3()
    bucket = conn.lookup(mp_bucketname)
    mp = boto.s3.multipart.MultiPartUpload(bucket)
    mp.key_name = mp_keyname
    mp.id = mp_id
    return mp
</pre></p>
<p>This object, together with the number of the file part and the file itself, are used to transfer that section of the file. The file part is removed after successful upload.</p>
<p><pre class="brush: python;">
@map_wrap
def transfer_part(mp_id, mp_keyname, mp_bucketname, i, part):
    mp = mp_from_ids(mp_id, mp_keyname, mp_bucketname)
    print &quot; Transferring&quot;, i, part
    with open(part) as t_handle:
        mp.upload_part_from_file(t_handle, i+1)
    os.remove(part)
</pre></p>
<p>When all sections, distributed over all processors, are finished, the multipart upload is signaled complete and Amazon finishes the process. Your file is now available on S3.</p>
</div>
<div id="parallel-download">
<h2>Parallel download</h2>
<p>Download speeds can be maximized by utilizing several existing parallelized accelerators:</p>
<ul>
<li><a href="http://axel.alioth.debian.org/">axel</a></li>
<li><a href="http://aria2.sourceforge.net/">aria2</a></li>
<li><a href="http://lftp.yar.ru/">lftp</a></li>
</ul>
<p>Combine these with the uploader to build up a cloud analysis workflow: move your data to S3, run a complex analysis pipeline on EC2, push the results back to S3, and then download them to local machines. Please share other tips and tricks you use to deal with Amazon file transfer in the comments.</p>
</div>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/bcbio.wordpress.com/187/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/bcbio.wordpress.com/187/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/bcbio.wordpress.com/187/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/bcbio.wordpress.com/187/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/bcbio.wordpress.com/187/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/bcbio.wordpress.com/187/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/bcbio.wordpress.com/187/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/bcbio.wordpress.com/187/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/bcbio.wordpress.com/187/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/bcbio.wordpress.com/187/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/bcbio.wordpress.com/187/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/bcbio.wordpress.com/187/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/bcbio.wordpress.com/187/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/bcbio.wordpress.com/187/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bcbio.wordpress.com&amp;blog=5850073&amp;post=187&amp;subd=bcbio&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://bcbio.wordpress.com/2011/04/10/parallel-upload-to-amazon-s3-with-python-boto-and-multiprocessing/feed/</wfw:commentRss>
		<slash:comments>16</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">Brad Chapman</media:title>
		</media:content>
	</item>
		<item>
		<title>Next generation sequencing information management and analysis system for Galaxy</title>
		<link>http://bcbio.wordpress.com/2011/01/11/next-generation-sequencing-information-management-and-analysis-system-for-galaxy/</link>
		<comments>http://bcbio.wordpress.com/2011/01/11/next-generation-sequencing-information-management-and-analysis-system-for-galaxy/#comments</comments>
		<pubDate>Tue, 11 Jan 2011 15:05:56 +0000</pubDate>
		<dc:creator>Brad Chapman</dc:creator>
				<category><![CDATA[OpenBio]]></category>
		<category><![CDATA[bioinformatics]]></category>
		<category><![CDATA[galaxy]]></category>
		<category><![CDATA[ngs]]></category>

		<guid isPermaLink="false">http://bcbio.wordpress.com/?p=169</guid>
		<description><![CDATA[Next generation sequencing technologies like Illumina, SOLiD and 454 have provided core facilities with the ability to produce large amounts of sequence data. Along with this increased output comes the challenge of managing requests and samples, tracking sequencing runs, and automating downstream analyses. Our group at Massachusetts General Hospital approached these challenges by developing a [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bcbio.wordpress.com&amp;blog=5850073&amp;post=169&amp;subd=bcbio&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Next generation sequencing technologies like <a href="http://www.illumina.com/technology/sequencing_technology.ilmn">Illumina</a>, <a href="http://solid.appliedbiosystems.com/">SOLiD</a> and <a href="http://www.454.com/">454</a> have provided core facilities with the ability to produce large amounts of sequence data. Along with this increased output comes the challenge of managing requests and samples, tracking sequencing runs, and automating downstream analyses.</p>
<p>Our group at <a href="http://genetics.mgh.harvard.edu/bioinformatics/">Massachusetts General Hospital</a> approached these challenges by developing a sample submission and tracking interface on top of the web-based <a href="http://bitbucket.org/galaxy/galaxy-central/wiki/Home">Galaxy</a> data integration platform. It provides a front end for biologists to enter their sample details and monitor the status of a project. For lab technicians doing the sample preparation and sequencing work, the system tracks sample states via a set of progressive queues providing data entry points at each step of the process. On the back end, an automated analysis pipeline processes data as it arrives off the sequencer, uploading the results back into Galaxy.</p>
<p>This post will show videos of the interface in action, describe installation and extension of the system, and detail the implementation architecture.</p>
<div id="front-end-usage">
<h2>Front-end usage</h2>
<div id="researcher-sample-entry">
<h3>Researcher sample entry</h3>
<p>Biologists use a local Galaxy server as an entry point to submit samples for sequencing. This provides a familiar interface and central location for both entering sample information and retrieving and analyzing the sequencing data.</p>
<p>Practically, a user begins by browsing to the sample submission page. There they are presented with a wizard interface which guides them through entry of sample details. Multiplexed samples are supported through a drag and drop interface.</p>
<span style="text-align:center; display: block;"><a href="http://bcbio.wordpress.com/2011/01/11/next-generation-sequencing-information-management-and-analysis-system-for-galaxy/"><img src="http://img.youtube.com/vi/HGhNMeEAFV0/2.jpg" alt="" /></a></span>
<p>When all samples are entered, the user submits them as a sequencing project. This includes billing information and a project name to facilitate communication between the researcher and the core group about submissions. Users are able to view their submissions grouped as projects and track the state of constructs. Since we support a number of services in addition to sequencing &#8212; like library construction, quantitation and validation &#8212; this is a valuable way for users to track and organize their requests.</p>
<span style="text-align:center; display: block;"><a href="http://bcbio.wordpress.com/2011/01/11/next-generation-sequencing-information-management-and-analysis-system-for-galaxy/"><img src="http://img.youtube.com/vi/DtQG9IzpoCU/2.jpg" alt="" /></a></span>
</div>
<div id="sequencing-tracking-and-management">
<h3>Sequencing tracking and management</h3>
<p>Administrators and sequencing technicians have access to additional functionality to help manage the internal sample preparation and sequencing workflow. The main sample tracking interface centers around a set of queues; each queue represents a state that a sample can be in. Samples move through the queues as they are processed, with additional information being added to the sample at each step. For instance, a sample in the &#8216;Pre-sequencing quantitation&#8217; queue moves to the &#8216;Sequencing&#8217; queue once it has been fully quantitated, with that quantitation information entered by the sequencing technician during the transition.</p>
<p>Assigning samples to flow cells occurs using a drag and drop <a href="http://jqueryui.com/">jQueryUI</a> interface. The design is flexible to allow for placing samples across multiple lanes or multiplexing multiple barcoded samples into a single lane.</p>
<span style="text-align:center; display: block;"><a href="http://bcbio.wordpress.com/2011/01/11/next-generation-sequencing-information-management-and-analysis-system-for-galaxy/"><img src="http://img.youtube.com/vi/Sjt6y1lbzVI/2.jpg" alt="" /></a></span>
</div>
<div id="viewing-sequencing-results">
<h3>Viewing sequencing results</h3>
<p>Running a sequencing machine requires careful monitoring of results and our interface provides several ways to view this data. Raw cluster and read counts are linked to a list of runs. For higher level analyses, interactive plots are available for viewing reads over time and pass rates compared to read density. These allow adjustment of experimental procedures to maximize useful reads based on current machine chemistry.</p>
<span style="text-align:center; display: block;"><a href="http://bcbio.wordpress.com/2011/01/11/next-generation-sequencing-information-management-and-analysis-system-for-galaxy/"><img src="http://img.youtube.com/vi/4xrtPXE7Oe8/2.jpg" alt="" /></a></span>
</div>
</div>
<div id="analysis-pipeline">
<h2>Analysis pipeline</h2>
<p>Utilizing a front end that organizes requests allows sequencing results to be processed through a fully automated analysis pipeline on the back end. The pipeline detects runs coming off of a sequencer, transfers files to storage and analysis machines and manages a number of processing steps:</p>
<ul>
<li>Alignment with <a href="http://bowtie-bio.sourceforge.net/">bowtie</a> or <a href="http://bio-bwa.sourceforge.net/">bwa</a>.</li>
<li>Generation of alignment and read statistics with <a href="http://picard.sourceforge.net/">Picard</a>, the <a href="http://hannonlab.cshl.edu/fastx_toolkit/">fastx toolkit</a> and <a href="http://solexaqa.sourceforge.net/">SolexaQA</a>.</li>
<li>Preparation of a <a href="http://bcbio.files.wordpress.com/2011/01/example-summary.pdf">summary PDF</a> with detailed statistics about the run and alignment.</li>
</ul>
<p>In addition to the default analysis, a full SNP calling pipeline is included with:</p>
<ul>
<li><a href="http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit">GATK</a> <a href="http://www.broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration">recalibration</a> and <a href="http://www.broadinstitute.org/gsa/wiki/index.php/Local_realignment_around_indels">realignment</a></li>
<li>SNP identification with GATK&#8217;s <a href="http://www.broadinstitute.org/gsa/wiki/index.php/Unified_genotyper">unified genotyper</a></li>
<li>Variant effect prediction with <a href="http://sourceforge.net/projects/snpeff/">snpEff</a>.</li>
</ul>
<p>Fastq reads, alignments files, summary PDFs and other associated files are uploaded back into into Galaxy Data Libraries organized by sample names. Users can download results for offline work, or import them directly into their Galaxy history for further analysis or display.</p>
<p><a href="http://bcbio.files.wordpress.com/2011/01/data_library_results.png"><img src="http://bcbio.files.wordpress.com/2011/01/data_library_results.png?h=300&w=700" alt="Uploaded analysis files in data library" /></a></p>
</div>
<div id="installing-and-extending">
<h2>Installing and extending</h2>
<p>The code base is maintained as a <a href="http://bitbucket.org/chapmanb/galaxy-central">Bitbucket repository</a> that tracks the main <a href="http://bitbucket.org/galaxy/galaxy-central/src">galaxy-central</a> distribution. It is updated from the main site regularly to maintain compatibility, with the future goal of integrating a generalized version into the main source tree. <a href="https://bitbucket.org/galaxy/galaxy-central/wiki/LIMS/nglims">Detailed installation instructions</a> are available for setting up the front-end client.</p>
<p>The analysis pipeline is written in Python and drives a number of open source programs; it is available as a <a href="https://github.com/chapmanb/bcbb/tree/master/nextgen">GitHub repository</a> with documentation and installation instructions.</p>
<p>We are using the current system in production and continue to develop and add features based on user feedback. We would like to generalize this for other research cores with additional instruments and services, and would be happy to hear from developers working on this type of system for their facilities.</p>
</div>
<div id="implementation-details">
<h2>Implementation details</h2>
<p>This work would not have been possible without the great open source toolkits and frameworks that it builds on. Galaxy provides not only an analysis framework, but also a ready to use database structure for managing samples and requests. The front end builds off existing <a href="http://main.g2.bx.psu.edu/u/rkchak/p/sts">Galaxy sample tracking</a> work, and requires only two new database storage tables.</p>
<p>The main change from the existing sample tracking framework is a generalization of the sample and request relationships. Requests can both contain samples, and be a part of samples so that a sequenced sample is organized as:</p>
<p><a href="http://bcbio.files.wordpress.com/2010/03/request_samples.png"><img src="http://bcbio.files.wordpress.com/2010/03/request_samples.png?h=300&w=700" alt="Request sample database architecture" /></a></p>
<p>By reusing and expanding the great work of the Galaxy team, we hope to eventually integrate useful parts of this work into the Galaxy codebase.</p>
</div>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/bcbio.wordpress.com/169/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/bcbio.wordpress.com/169/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/bcbio.wordpress.com/169/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/bcbio.wordpress.com/169/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/bcbio.wordpress.com/169/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/bcbio.wordpress.com/169/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/bcbio.wordpress.com/169/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/bcbio.wordpress.com/169/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/bcbio.wordpress.com/169/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/bcbio.wordpress.com/169/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/bcbio.wordpress.com/169/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/bcbio.wordpress.com/169/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/bcbio.wordpress.com/169/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/bcbio.wordpress.com/169/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bcbio.wordpress.com&amp;blog=5850073&amp;post=169&amp;subd=bcbio&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://bcbio.wordpress.com/2011/01/11/next-generation-sequencing-information-management-and-analysis-system-for-galaxy/feed/</wfw:commentRss>
		<slash:comments>26</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">Brad Chapman</media:title>
		</media:content>

		<media:content url="http://bcbio.files.wordpress.com/2011/01/data_library_results.png?h=300" medium="image">
			<media:title type="html">Uploaded analysis files in data library</media:title>
		</media:content>

		<media:content url="http://bcbio.files.wordpress.com/2010/03/request_samples.png?h=300" medium="image">
			<media:title type="html">Request sample database architecture</media:title>
		</media:content>
	</item>
		<item>
		<title>CloudBioLinux: progress on bioinformatics cloud images and data</title>
		<link>http://bcbio.wordpress.com/2010/10/13/cloudbiolinux-progress-on-bioinformatics-cloud-images-and-data/</link>
		<comments>http://bcbio.wordpress.com/2010/10/13/cloudbiolinux-progress-on-bioinformatics-cloud-images-and-data/#comments</comments>
		<pubDate>Wed, 13 Oct 2010 23:19:39 +0000</pubDate>
		<dc:creator>Brad Chapman</dc:creator>
				<category><![CDATA[OpenBio]]></category>
		<category><![CDATA[bioinformatics]]></category>
		<category><![CDATA[cloud-computing]]></category>
		<category><![CDATA[ec2]]></category>
		<category><![CDATA[ngs]]></category>

		<guid isPermaLink="false">http://bcbio.wordpress.com/?p=164</guid>
		<description><![CDATA[My last post introduced a framework for building bioinformatics cloud images, which makes it easy to do biological computing work using Amazon EC2 and other on-demand computing providers. Since that initial announcement we&#8217;ve had amazing interest from the community and made great progress with: A permanent web site at cloudbiolinux.org Additional software and genomic data [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bcbio.wordpress.com&amp;blog=5850073&amp;post=164&amp;subd=bcbio&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>My last post introduced a <a href="http://bcbio.wordpress.com/2010/05/08/automated-build-environment-for-bioinformatics-cloud-images/">framework for building bioinformatics cloud images</a>, which makes it easy to do biological computing work using <a href="http://aws.amazon.com/ec2/">Amazon EC2</a> and other on-demand computing providers. Since that initial announcement we&#8217;ve had amazing interest from the community and made great progress with:</p>
<ul>
<li>A permanent web site at <a href="http://cloudbiolinux.org/">cloudbiolinux.org</a></li>
<li>Additional software and genomic data</li>
<li>New user documentation</li>
<li>A community coding session: <a href="http://www.open-bio.org/wiki/Codefest_2010">Codefest 2010</a></li>
</ul>
<div id="new-software-and-data">
<h2>New software and data</h2>
<p>The most exciting changes have been the rapid expansion of installed software and libraries. The goal is to provide an image that experienced developers will find as useful as their custom configured servers. A <a href="http://github.com/chapmanb/cloudbiolinux/tree/master/contributors.mkd">great group of contributors</a> have put together a large set of <a href="http://github.com/chapmanb/cloudbiolinux/tree/master/config/#readme">programs and libraries</a>; the configuration files have all the details on <a href="https://github.com/chapmanb/cloudbiolinux/blob/master/config/packages.yaml">installed programs</a> as well as libraries for <a href="https://github.com/chapmanb/cloudbiolinux/blob/master/config/python-libs.yaml">Python</a>, <a href="https://github.com/chapmanb/cloudbiolinux/blob/master/config/perl-libs.yaml">Perl</a>, <a href="https://github.com/chapmanb/cloudbiolinux/blob/master/config/ruby-libs.yaml">Ruby</a>, and <a href="https://github.com/chapmanb/cloudbiolinux/blob/master/config/r-libs.yaml">R</a>. Another addition is support for <a href="https://github.com/chapmanb/cloudbiolinux/blob/master/config/custom.yaml">non-packaged programs</a> which provides software not yet neatly wrapped in a package manger or library-specific install system: next-gen software packages like <a href="http://picard.sourceforge.net/">Picard</a>, <a href="http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit">GATK</a> and <a href="http://bowtie-bio.sourceforge.net/">Bowtie</a> are installed through custom scripts.</p>
<p>To improve accessibility for developers who prefer a desktop experience, a <a href="http://freenx.berlios.de/">FreeNX server</a> was integrated with the provided images. Tim Booth from the <a href="http://nebc.nerc.ac.uk/tools/bio-linux">NEBC Bio-Linux</a> team headed up the integration of FreeNX, and the user experience looks very similar to a locally installed Bio-Linux desktop.</p>
<p>In addition to the software image, a publicly available data volume is now available that contains:</p>
<ul>
<li>Genome sequences pre-indexed for search with next-gen aligners like <a href="http://bowtie-bio.sourceforge.net/">Bowtie</a>, <a href="http://www.novocraft.com/main/page.php?s=novoalign">Novoalign</a>, and <a href="http://bio-bwa.sourceforge.net/">BWA</a>.</li>
<li><a href="http://genome.ucsc.edu/cgi-bin/hgLiftOver">LiftOver</a> files for mapping between sequence coordinates.</li>
<li><a href="http://www.ebi.ac.uk/uniref/">UniRef</a> protein databases, indexed for searching with <a href="http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&amp;PAGE_TYPE=BlastDocs&amp;DOC_TYPE=Download">BLAST+</a>.</li>
</ul>
<p>Coupled with the software images, this volume makes it easy to do next-gen analyses. Start up an Amazon AMI, attach the genome data volume, transfer your fastq file to the instance, and kick off the analysis. The overhead of software installation and genome indexing is completely removed. Thanks to the work of Enis Afgan and James Taylor of <a href="http://galaxyproject.org">Galaxy</a>, the data volume plugs directly into <a href="http://bitbucket.org/galaxy/galaxy-central/wiki/cloud">Galaxy&#8217;s ready to use cloud image</a>. Coupling the data and software with Galaxy provides a familiar web interface for running tools and developing biological workflows.</p>
<p>The data volume preparation is fully automated via a <a href="https://github.com/chapmanb/cloudbiolinux/blob/master/data_fabfile.py">fabric install script</a>, similar to the <a href="https://github.com/chapmanb/cloudbiolinux/blob/master/fabfile.py">software install script</a>. Additional data sources are easily integrated, and we hope to expand the available datasets based on feedback from the community.</p>
</div>
<div id="documentation-and-presentations">
<h2>Documentation and presentations</h2>
<p>The software and data volumes are only as good as the documentation which helps people use them:</p>
<ul>
<li>Bela Tiwari of the NEBC Bio-Linux team has written an <a href="http://bazaar.launchpad.net/%7Ebt27uk/cloud-bl-tutorials/trunk/download/head%3A/gettingstarted_cloud-20100913135735-ilr1gse4kxo9ylx6-1/gettingStarted_CloudBioLinux.pdf">excellent introduction to Amazon EC2 and CloudBioLinux</a>. This breaks down the process of signing up for an account, creating a software image, associating data volumes and setting up a graphical server. It&#8217;s a great place to get started with CloudBioLinux.</li>
<li>Ntino Krampis, from the <a href="http://www.jcvi.org/cms/research/projects/jcvi-cloud-biolinux/overview/">JCVI Cloud Bio-Linux</a> project, gave a <a href="http://www.slideshare.net/agbiotec/chi-next-genntinokrampis">presentation on CloudBioLinux</a> explaining the motivation behind the project and providing usage examples.</li>
<li>My presentation on <a href="http://www.slideshare.net/chapmanb/chapman-opensource-cloud">the open source community behind CloudBioLinux</a> from <a href="http://aws.amazon.com/genomics_workshop/">Amazon&#8217;s Genomic Data workshop</a>. This details the project goals and automated code organization.</li>
</ul>
</div>
<div id="community:-codefest-2010">
<h2>Community: Codefest 2010</h2>
<p>The CloudBioLinux community had a chance to work together for two days in July at <a href="http://www.open-bio.org/wiki/Codefest_2010">Codefest 2010</a>. In conjunction with the <a href="http://www.open-bio.org/wiki/BOSC_2010">Bioinformatics Open Source Conference (BOSC)</a> in Boston, this was a free to attend coding session hosted at <a href="http://www.hsph.harvard.edu/research/bioinfocore/">Harvard School of Public Health</a> and <a href="http://www.mgh.harvard.edu/">Massachusetts General Hospital</a>. Over 30 developers donated two days of their time to working on CloudBioLinux and other bioinformatics open source projects.</p>
<p>Many of the advances in CloudBioLinux detailed above were made possible through this session: the FreeNX graphical client integration, documentation, Galaxy interoperability, and many library and data improvements were started during the two days of coding and discussions. Additionally, the relationships developed are the foundation for better communication amongst open source projects, which is something we need to be continually striving for in the scientific computing world.</p>
<p>It was amazing and inspiring to get such positive feedback from so many members of the bioinformatics community. We&#8217;re planning another session next year in Vienna, again just before <a href="http://www.iscb.org/ismbeccb2011">BOSC and ISMB 2011</a>; and again, everyone is welcome.</p>
</div>
<div id="summary">
<h2>Summary</h2>
<p>Go to the <a href="http://cloudbiolinux.org/">CloudBioLinux website</a> for the latest publicly available images and data volumes, which are ready to use on Amazon EC2. With <a href="http://aws.amazon.com/about-aws/whats-new/2010/09/09/announcing-micro-instances-for-amazon-ec2/">Amazon&#8217;s new micro-images</a> you can start analyzing data for only a few cents an hour. It&#8217;s an easy way to explore if cloud resources will help with computational demands in your work. We&#8217;re very interested in feedback and happy to have other developers helping out; please get in touch on the <a href="http://groups.google.com/group/cloudbiolinux">CloudBioLinux mailing list</a>.</p>
</div>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/bcbio.wordpress.com/164/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/bcbio.wordpress.com/164/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/bcbio.wordpress.com/164/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/bcbio.wordpress.com/164/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/bcbio.wordpress.com/164/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/bcbio.wordpress.com/164/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/bcbio.wordpress.com/164/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/bcbio.wordpress.com/164/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/bcbio.wordpress.com/164/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/bcbio.wordpress.com/164/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/bcbio.wordpress.com/164/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/bcbio.wordpress.com/164/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/bcbio.wordpress.com/164/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/bcbio.wordpress.com/164/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bcbio.wordpress.com&amp;blog=5850073&amp;post=164&amp;subd=bcbio&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://bcbio.wordpress.com/2010/10/13/cloudbiolinux-progress-on-bioinformatics-cloud-images-and-data/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">Brad Chapman</media:title>
		</media:content>
	</item>
		<item>
		<title>Automated build environment for Bioinformatics cloud images</title>
		<link>http://bcbio.wordpress.com/2010/05/08/automated-build-environment-for-bioinformatics-cloud-images/</link>
		<comments>http://bcbio.wordpress.com/2010/05/08/automated-build-environment-for-bioinformatics-cloud-images/#comments</comments>
		<pubDate>Sat, 08 May 2010 14:35:21 +0000</pubDate>
		<dc:creator>Brad Chapman</dc:creator>
				<category><![CDATA[OpenBio]]></category>
		<category><![CDATA[bioinformatics]]></category>
		<category><![CDATA[cloud-computing]]></category>
		<category><![CDATA[ec2]]></category>

		<guid isPermaLink="false">http://bcbio.wordpress.com/?p=150</guid>
		<description><![CDATA[Amazon web services provide scalable, on demand computational resources through their elastic compute cloud (EC2). Previously, I described the goal of providing publicly available machine images loaded with bioinformatics tools. I&#8217;m happy to describe an initial step in that direction: an automated build system, using easily editable configuration files, that generates a bioinformatics-focused Amazon Machine [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bcbio.wordpress.com&amp;blog=5850073&amp;post=150&amp;subd=bcbio&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><a href="http://aws.amazon.com">Amazon web services</a> provide scalable, on demand computational resources through their <a href="http://aws.amazon.com/ec2/">elastic compute cloud (EC2)</a>. Previously, I <a href="http://bcbio.wordpress.com/2009/09/07/usage-plans-for-amazon-web-services-research-grant/">described the goal</a> of providing publicly available machine images loaded with bioinformatics tools. I&#8217;m happy to describe an initial step in that direction: an automated build system, using easily editable configuration files, that generates a bioinformatics-focused <a href="http://en.wikipedia.org/wiki/Amazon_Machine_Image">Amazon Machine Image (AMI)</a> containing packages integrated from several existing efforts. The hope is to consolidate the community&#8217;s open source work around a single, continuously improving, machine image.</p>
<p>This image incorporates software from several existing AMIs:</p>
<ul>
<li><a href="http://www.jcvi.org/cms/research/projects/jcvi-cloud-biolinux/overview/">JCVI Cloud BioLinux</a> &#8212; JCVI&#8217;s work porting Bio-Linux to the cloud.</li>
<li><a href="http://fortinbras.us/bioperl-max/">bioperl-max</a> &#8212; Fortinbras&#8217; package of BioPerl and associated informatics tools.</li>
<li><a href="http://blog.infochimps.org/2009/02/06/start-hacking-machetec2-released/">MachetEC2</a> &#8212; An InfoChimps image loaded with data mining software.</li>
</ul>
<p>Each of these libraries inspired different aspects of developing this image and associated infrastructure, and I&#8217;m extremely grateful to the authors for their code, documentation and discussions.</p>
<p>The current AMI is available for loading on EC2 &#8212; search for &#8216;CloudBioLinux&#8217; in the  <a href="https://console.aws.amazon.com/">AWS console</a> or go to <a href="http://cloudbiolinux.org">the CloudBioLinux project page</a> for the latest AMIs. Automated scripts and configuration files with contained packages are available as a  <a href="https://github.com/chapmanb/cloudbiolinux">GitHub repository</a>.</p>
<h2>Contributions encouraged</h2>
<p>This image is intended as a starting point for developing a community resource that provides biology and data-mining oriented software. Experienced developers should be able to fire up this image and expect to find the same up to date libraries and programs they have installed  on their work machines. If their favorite package is missing it should be quick  and easy to add, making the improvement available to future developers.</p>
<p>Achieving these goals requires help and contributions from other programmers utilizing the cloud &#8212; everyone reading this. The current image is ready to be used, but is more complete in areas where I normally work.  For instance, the Python and R libraries are off to a good start. I&#8217;d like to extend an invitation to folks with expertise in other areas to  help improve the coverage of this AMI:</p>
<ul>
<li>Programmers: help expand the configuration files for your areas of interest:
<ul>
<li>Perl CPAN support and libraries</li>
<li>Ruby gems</li>
<li>Java libraries</li>
<li>Haskell hackage support and libraries</li>
<li>Erlang libraries</li>
<li>Bioinformatics areas of specialization:
<ul>
<li>Next-gen sequencing</li>
<li>Structural biology</li>
<li>Parallelized algorithms</li>
</ul>
</li>
<li>Much more&#8230; Let us know what you are interested in.</li>
</ul>
</li>
<li>Documentation experts: provide cookbook style instructions to help others get started.</li>
<li>Porting specialists: The automation infrastructure is dependent on having  good ports for libraries and programs. Many widely used biological programs are not  yet ported. Establishing a Debian or Ubuntu port for a missing program will not only  help this effort, but make the programs more widely available.</li>
<li>Systems administrators: The ultimate goal is to have the AMI be automatically updated on a regular basis with the latest changes. We&#8217;d like to set up an Amazon instance that pulls down the latest configuration, populates an image, builds the AMI, and then updates a central web page and REST API for getting the latest and greatest.</li>
<li>Testers: Check that this runs on <a href="http://www.eucalyptus.com/">open source Eucalyptus</a> clouds, additional linux distributions, and other cloud deployments.</li>
</ul>
<p>If any of this sounds interesting, please get in contact. The <a href="http://groups.google.com/group/cloudbiolinux">Cloud BioLinux mailing list</a> is a good central point for discussion.</p>
<h2>Infrastructure overview</h2>
<p>In addition to supplying an image for downstream use, this implementation was designed to be easily extendible. Inspired by the <a href="http://github.com/infochimps/machetec2">MachetEC2</a> project, packages to be installed are entered into a set of easy to edit configuration files in <a href="http://www.yaml.org/">YAML</a> syntax. There are  three different configuration file types:</p>
<ul>
<li><a href="https://github.com/chapmanb/cloudbiolinux/blob/master/config/main.yaml">main.yaml</a> &#8212; The high level configuration file defining which groups of packages to install. This allows a user to build a custom image simply by commenting out those groups which are not of interest.</li>
<li><a href="https://github.com/chapmanb/cloudbiolinux/blob/master/config/packages.yaml">packages.yaml</a> &#8212; Defines debian/ubuntu packages to be installed. This leans heavily on the work of <a href="http://www.debian.org/devel/debian-med/">DebianMed</a> and <a href="http://nebc.nox.ac.uk/tools/bio-linux/">Bio-Linux</a> communities, as well as all of the hard working package maintainers for the distributions. If it exists in package form, you can list it here.</li>
<li><a href="https://github.com/chapmanb/cloudbiolinux/blob/master/config/python-libs.yaml">python-libs.yaml</a>, <a href="https://github.com/chapmanb/cloudbiolinux/blob/master/config/r-libs.yaml">r-libs.yaml</a> &#8212; These take advantage of language specific ways of installing libraries. Currently implemented is support for Python library installation  from the <a href="http://pypi.python.org/">Python package index</a>, and R library installation from <a href="http://cran.r-project.org/">CRAN</a> and <a href="http://www.bioconductor.org/docs/install/">Bioconductor</a>. This will be expanded to include support for other languages.</li>
</ul>
<p>The <a href="http://docs.fabfile.org/">Fabric remote automated deployment tool</a> is used to build AMIs from  these configuration files. Written in Python, the <a href="https://github.com/chapmanb/cloudbiolinux/blob/master/fabfile.py">fabfile</a> automates the process of installing packages on the cloud machine.</p>
<p>We hope that the straightforward architecture of the build system will encourage  other developers to dig in and provide additional coverage of program and libraries  through the configuration files. For those comfortable with Python, the fabfile is  very accessible for adding in new functionality.</p>
<p>If you are interested in face-to-face collaboration and will be in the Boston area on July 7th and 8th, check out <a href="http://www.open-bio.org/wiki/Codefest_2010">Codefest 2010</a>; it&#8217;ll be two enjoyable days of cloud informatics development. I&#8217;m looking forward to hearing from other developers who are interested in building and maintaining an easy to use, up to date, machine image that can help make biological computation more accessible to the community.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/bcbio.wordpress.com/150/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/bcbio.wordpress.com/150/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/bcbio.wordpress.com/150/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/bcbio.wordpress.com/150/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/bcbio.wordpress.com/150/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/bcbio.wordpress.com/150/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/bcbio.wordpress.com/150/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/bcbio.wordpress.com/150/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/bcbio.wordpress.com/150/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/bcbio.wordpress.com/150/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/bcbio.wordpress.com/150/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/bcbio.wordpress.com/150/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/bcbio.wordpress.com/150/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/bcbio.wordpress.com/150/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bcbio.wordpress.com&amp;blog=5850073&amp;post=150&amp;subd=bcbio&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://bcbio.wordpress.com/2010/05/08/automated-build-environment-for-bioinformatics-cloud-images/feed/</wfw:commentRss>
		<slash:comments>15</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">Brad Chapman</media:title>
		</media:content>
	</item>
		<item>
		<title>Biopython projects for Google Summer of Code 2010</title>
		<link>http://bcbio.wordpress.com/2010/03/26/biopython-projects-for-google-summer-of-code-2010/</link>
		<comments>http://bcbio.wordpress.com/2010/03/26/biopython-projects-for-google-summer-of-code-2010/#comments</comments>
		<pubDate>Fri, 26 Mar 2010 13:02:12 +0000</pubDate>
		<dc:creator>Brad Chapman</dc:creator>
				<category><![CDATA[OpenBio]]></category>
		<category><![CDATA[bioinformatics]]></category>
		<category><![CDATA[biopython]]></category>
		<category><![CDATA[gsoc]]></category>

		<guid isPermaLink="false">http://bcbio.wordpress.com/2010/03/26/biopython-projects-for-google-summer-of-code-2010/</guid>
		<description><![CDATA[Google Summer of Code provides the unique opportunity for students to spend a summer working on open source projects and getting paid. Biopython was involved with two great projects last summer, and it&#8217;s time to apply for this year&#8217;s program: the student application period is from next Monday, March 29th to Friday, April 9th, 2010. [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bcbio.wordpress.com&amp;blog=5850073&amp;post=147&amp;subd=bcbio&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><a href="http://code.google.com/soc/">Google Summer of Code</a> provides the unique opportunity for students to spend a summer working on open source projects and getting paid. Biopython was <a href="http://bcbio.wordpress.com/2009/04/20/biopython-projects-for-google-summer-of-code/">involved with two great projects last summer</a>, and it&#8217;s time to apply for this year&#8217;s program: the student application period is from next Monday, March 29th to Friday, April 9th, 2010.
</p>
<p>If you are a student interested in biology and open source work, there are two community organizations to look at for mentors and project ideas:
</p>
<ul>
<li><a href="https://www.nescent.org/wg_phyloinformatics/Phyloinformatics_Summer_of_Code_2010">NESCent Phyloinformatics</a> &#8212; NESCent is a GSoC mentoring organization for the 4th year, focusing on projects related to phylogenetics and open source code.
<li><a href="http://www.open-bio.org/wiki/Google_Summer_of_Code">Open Bioinformatics Foundation</a> &#8212; The umbrella organization that manages BioPerl, Biopython, BioJava, BioRuby and several other popular open source bioinformatics projects is involved with GSoC for the first time.
</ul>
<p>This year, I&#8217;ve collaborated on three project ideas centering around the idea of tool integration. An essential programming skill for dealing with large heterogeneous data sets is combining a set of tools in a way that abstracts out the implementation details, instead allowing you to focus on the high level biological questions. <a href="http://measuringmeasures.com/">Bradford Cross</a>, a machine learning and data crunching expert at FlightCaster, describes this process brilliantly in <a href="http://www.datawrangling.com/how-flightcaster-squeezes-predictions-from-flight-data">an interview at Data Wrangling</a>.
</p>
<p>
These three project ideas allow a student to develop essential toolkit integration skills, while having the flexibility to work on biological questions relevant to their undergrad or graduate research:
</p>
<ul>
<li><a href="https://www.nescent.org/wg_phyloinformatics/Phyloinformatics_Summer_of_Code_2010#Biopython_and_PyCogent_interoperability">Biopython and PyCogent interoperability</a>
<li><a href="https://www.nescent.org/wg_phyloinformatics/Phyloinformatics_Summer_of_Code_2010#Galaxy_phylogenetics_pipeline_development">Phylogenetics pipeline development in Galaxy</a>
<li><a href="https://www.nescent.org/wg_phyloinformatics/Phyloinformatics_Summer_of_Code_2010#Accessing_R_phylogenetic_tools_from_Python">Building python APIs for R phylogenetic toolkits</a>
</ul>
<p>All involve taking two or more different toolkits and combining the functionality into a higher level interface focused around ease of use. They are intentionally broad and flexible ideas, and a student proposal should concentrate on functionality most relevant to their biological questions. Ideally the work would be both a publicly available resource, and contribute directly to the student&#8217;s daily research.
</p>
<p>If you&#8217;re interested in these ideas and in working with a set of great mentors, definitely get in touch with me either through the project mailing lists or directly. If none of these ideas strike your fancy but you would like to be involved with GSoC, get in touch with a mentor from one of the other project ideas at NESCent and OpenBio. It&#8217;s a unique opportunity to develop new coding skills, work with great mentors, and give back to the open source community.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/bcbio.wordpress.com/147/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/bcbio.wordpress.com/147/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/bcbio.wordpress.com/147/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/bcbio.wordpress.com/147/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/bcbio.wordpress.com/147/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/bcbio.wordpress.com/147/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/bcbio.wordpress.com/147/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/bcbio.wordpress.com/147/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/bcbio.wordpress.com/147/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/bcbio.wordpress.com/147/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/bcbio.wordpress.com/147/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/bcbio.wordpress.com/147/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/bcbio.wordpress.com/147/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/bcbio.wordpress.com/147/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=bcbio.wordpress.com&amp;blog=5850073&amp;post=147&amp;subd=bcbio&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://bcbio.wordpress.com/2010/03/26/biopython-projects-for-google-summer-of-code-2010/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">Brad Chapman</media:title>
		</media:content>
	</item>
	</channel>
</rss>
