Archive for the ‘OpenBio’ Category
Summary from Bioinformatics Open Science Codefest 2013: Tools, infrastructure, standards and visualization
The 2013 Bioinformatics Open Source Conference (BOSC) starts tomorrow in Berlin, Germany. It’s a yearly conference devoted to community-based software development projects supporting biological research. Members of the Open Bioinformatics Foundation discuss implementations and approaches to better provide interoperable and reusable software, libraries and pipelines.
For the past five years, a two day Codefest and hackathon preceded the conference. This gives programmers time to work face-to-face, sharing approaches and discovering connections between projects. This year, the the Department of Biology, Humboldt-Universität zu Berlin kindly hosted Codefest 2013. Thanks to the organizers and attendees, we finished projects ranging from tool development, infrastructure integration, standards development and visualization. There are photos of the Codefest in progress and a detailed writeup of projects.
Below we summarize the accomplishments from the two days. We welcome feedback on the topics covered and hope that by sharing our work we can encourage more programmers to become part of the open science bioinformatics community. Actively working to build well-tested, community-developed, interoperable tools is how we solve increasingly difficult research questions ranging from human health to plant breeding to microbial community function. The progress made in two days illuminates the effectiveness of open collaborative science.
BioRuby and BaseSpace – Develop SDK and apps for Illumina BaseSpace
Toshiaki Katayama, Raoul Bonnal, Eri Kibukawa, Joachim Baran, Dan MacLean, Fernando Izquierdo-Carrasco, Spencer Bliven
During the Codefest, we tested and documented our port of the BaseSpace Python SDK to Ruby. Ruby/Biogem developers can now easily utilize next-generation sequencing code within the Illumina’s BaseSpace framework. For non-Ruby programers, we found that it can be a burden to create new Web app from scratch on top of your NGS program. So we started new project to provide a Web-app scaffold for BaseSpace. We have already implemented the basic portion but will need some more time before releasing the BioBaseSpace application. The BaseSpace Ruby SDK was officially released: for more information, see Joachim’s blog post, the official announcement from the BioRuby team and the annoucement from Illumina.
Barrnap – Bacterial ribosomal RNA predictor
Torsten Seemann, Tim Booth
For the last 8 years RNAmmer has been the standard tool for predicting ribosomal RNA features in genomes, because it is reasonably fast, accurate, and works on bacteria and eukaryotes. Its drawbacks are that it relies on small, older databases; requires an older conflicting version of HMMER; and has restrictive licence terms. To resolve these issues we have implemented a new rRNA predictor which uses the new “nmmer” tool from HMMER 3.1 for searching DNA profiles against DNA sequence. We used the Silva and GreenGenes seed alignments for the 5S, 23S and 16S genes to build the profile models from. Barrnap is a small Perl script which takes FASTA as input, and outputs the rRNA feature predictions in GFF3 format. It will be packaged in Bio-Linux and replace RNAmmer in the Prokka bacterial annotation system.
BioJVM – Coordinating and integrating BioJava and ScaBio
Spencer Bliven, Andreas Prlic, Markus Gumbel
Both Java and Scala run on the Java Virtual Machine. As such, it makes sense to coordinate and document the various Bio* projects which run on the JVM and therefor can interoperate to some degree. We are able to successfully reference BioJava functions from Scala code and ScaBio functions from Java code. The ease of this process means that users can easily use both libraries from whichever language is more suited for their biological problem.
Peter Cock, Konstantin Tretyakov, Bin Zhang
The Biopython team worked on training new users at Codefest and exploring integration of Biopython with other Python molecular visualization toolkits like PyMol. Infrastructure development involved testing and debugging on multiple systems, including identifying and fixing Windows and PyPy problems. We also identified areas where we can make it easier to contribute to Biopython: specifically easing the process to report and fix bugs by moving to integrated GitHub issue tracking and working to support Biopython-associated projects with easy installation tools.
I spent several hours revisiting previous work on the Galaxy package for Bio-Linux and made significant progress towards it being something that can go into Debian-proper. Results will be committed to Deb-Med public SVN and patches will be forwarded to the Galaxy dev mailing list.
Standards and Visualization
Ontology and provenance representation
Herve Menager, Bertrand Neron, Jackie Quinn, Stian Soiland-Reyes, Matus Kalas, Steffen Moller
The goal of this group was to investigate and implement solutions to use ontologies to help people find and use the programs and data they need for their work, and to help automate the integration of tools or data resources into workflows or workbenches. We also wanted to identify useful provenance metadata, to store in a rigorous way the conditions and configuration of analysis steps run by users. This improves transparency, reproducibility, and reliability of the scientific results.
We worked toward inclusion of the EDAM onotology as part of the Mobyle system’s built-in type and classification mechanisms. We created a user case by identify workflows in Mobyle and mapped the descriptions unto EDAM classification to allow mapping between the types. We also investigated the possibilities opened by projects such as PROV to standardize the provenance information stored by systems such as Mobyle. We added a prototype functionality to the development version of Mobyle that dynamically generates this provenance information in a JSON-based format.
Integrate DGE-Vis & Dalliance, JS animation scheduler
David Powell, Thomas Down, Skyler Brungardt, Alex Kalderimis
Infrastructure management via CloudBioLinux (CBL)
Enis Afgan, John Chilton, Brad Chapman
- Galaxy: We integrated custom installation procedures present in CBL with the Galaxy-tools versioned installation methodology.
- Documentation: Due to the increased interest by individuals to use and contribute to CBL, we invested effort into creating purpose-driven documentation for CBL. This should help people use the endproduct of CBL, customize CBL their needs, as well as learn about the internals of CBL with the aim of contributing. We will finish and make the documentation available on ReadTheDocs over the coming months.
- Build frameworks: We developed a simpler automated method to invoke the CBL build framework to help remove complex error prone steps.
- Web tooling: In spirit of making CBL more accessible and easier to use, we’ve decided to tackle development of a lightweight webapp that helps with customizing and generating CBL configuration files.
Improve ipython cluster support and runtime metrics
Valentine Svensson, Guillermo Carrasco, Roman Valls, Per Unneberg
We worked to extend the Ipython parallel cluster framework to support additional schedulers, specifically implementing SLURM support to supplement existing SGE, LSF, Torque and Condor schedulers. We plan to extend this to allow generalized use of the DRMAA connector, ultimately port such generalization into ipython so that python scientific computations can be executed efficiently across different clusters implementation. Both Roman and Guillermo blogged detailed documentation of the work in progress.
We also worked to build a tool that helps provide run time estimations for bioinformatcs jobs (e.g. “how long should aligning 40 million reads against hg19 with BWA take if I use 8 cores?”). We plan to collaborate on longer term development of this with the Genome Comparison of Analytic Testing team.
GATK-based reusable pipeline based around Rubra/Ruffus
Clare Sloggett, Bernie Pope
We worked on code cleanup, documentation and test data for a reusable pipeline to handle variant calling and annotation, using Rubra built on the Ruffus framework. It handles BWA alignment, GATK alignment cleaning and variant calling and ENSEMBL annotation. To make these pipelines easier to run, we worked on integrating them into the GVF flavor in CloudBioLinux.
On April 7th and 8th, a group of biologists and programmers gathered at the Broad Institute to work on improving interoperability of open-source bioinformatics tools. Organized by the Open Bioinformatics Foundation and GenomeSpace team, this was part of the lead up to the Bioinformatics Open Source Conference (BOSC) in July in Berlin. The event is part of an ongoing series of coding sessions (Codefests or Hackathons) organized by the open bioinformatics community, which give programmers who typically work together remotely a chance to code and discuss in the same place for two days. These have been successful in both producing new code and in building connections which help sustain development of these community projects.
Goals and outcomes
One major challenge in analyzing biological data is interfacing multiple bioinformatics tools. Tools often work independently, and where general architectures like plugins or API exist they are often project specific. This results in isolated islands of data exchange, but transferring data or resources between tools requires work that is often rate-limiting or insurmountable.
Our goal at the hackathon was to provide simple APIs and implementations that help facilitate transfers between multiple islands of functionality. GenomeSpace does this by providing a central hub and API to push and pull from tools. We wanted to generalize this to support multiple tools, and build client implementations that demonstrate this in practice. The long term goal is to encourage tool developers to provide server side APIs compatible with the more general library, making extension of the connector toolkit easier. For developers, the client API would allow them to easily transfer files between multiple tools without needing to learn and implement the specific transfer APIs of each tool.
We called this high level client library Genome Connector (gcon, for short) and took a practical approach by implementing client libraries that provide a common interface to multiple tools: GenomeSpace, Galaxy, BaseSpace, 23andMe and general key-value stores through jClouds. To identify a reasonable amount of work for two days, we focused on file transfer: authentication, finding files, getting and putting files to remote analysis platforms. In addition we defined some critical components for doing biological work:
- File metadata: We need to be able to store arbitrary key/value on objects to assign essential biological information necessary to interpret it, like organisms and genome build. In addition, metadata allows provenance and tracking of files by enabling annotation of files with history and processing steps.
- Filesets: Large biological files have secondary files with indexes, allowing indexed retrieval of data (for example: read bam and bai, variant vcf and idx, tabix gz and tbi). To avoid expensive reindexing, we want to group and transfer these together.
We also identified other useful extensions that would help improve interoperability and facilitate building connected tools, like providing Publish/subscribe hooks to avoid having to poll servers for updates, and smarter approaches to sending data to avoid duplication and unnecessary transfer of data.
The output of our discussion and coding are common Genome Connector implementations in multiple languages. GitHub repositories are available for in-progress Java, Python and Clojure implementations. These wrap multiple diverse tools and expose them through a common top level API, allowing developers to push and pull data from multiple tools.
I’m immensely grateful to the incredible participants who generously donated their time and expertise to help with these projects. For anyone interested we also have detailed documentation on discussions during the hackathon.
Bioinformatics Open Source Conference
If you’re a bioinformatics programmers interested in open source coding and helping answer biological questions by improving usability and connectivity of tools, you’re welcome to join the OpenBio and BOSC communities. We’ve created a biological interoperability mailing list for additional discussion. The next BOSC conference is July 19th and 20th in Berlin, Germany as part of the ISMB conference. There will also be another two day Codefest proceeding BOSC on July 17th and 18th. Abstracts for talks at BOSC are due this Friday, April 12th. Looking forward to seeing everyone at future BOSC and coding events.
My last post described a distributed exome analysis pipeline implemented on the CloudBioLinux and CloudMan frameworks. This was a practical introduction to running the pipeline on Amazon resources. Here I’ll describe how the pipeline runs in parallel, specifically diagramming the workflow to identify points of parallelization during lane and sample processing.
Incredible innovation in throughput makes parallel processing critical for next-generation sequencing analysis. When a single Hi-Seq run can produce 192 samples (2 flowcells x 8 lanes per flowcell x 12 barcodes per lane), the analysis steps quickly become limited by the number of processing cores available.
The heterogeneity of architectures utilized by researchers is a major challenge in building re-usable systems. A pipeline needs to support powerful multi-core servers, clusters and virtual cloud-based machines. The approach we took is to scale at the level of individual samples, lanes and pipelines, exploiting the embarassingly parallel nature of the computation. An AMQP messaging queue allows for communication between processes, independent of the system architecture. This flexible approach allows the pipeline to serve as a general framework that can be easily adjusted or expanded to incorporate new algorithms and analysis methods.
Process overview — points for parallel implementations
The first level of parallelization occurs during processing of each fastq lane. We split the file into individualized barcoded components, followed by alignment and BAM processing. The result is a sorted BAM file for each barcoded sub-sample, given a set of input fastq files:
The pipeline merges samples present in barcodes on multiple lanes, producing a single representative BAM file. The next step parallelizes the processing of each alignment file with read quality assessment, preparation for visualization and variant calling:
The variant calling steps utilize The Genome Analysis Toolkit (GATK) from the Broad Institute. It prepares alignments by recalibrating initial quality scores given the aligned sequences and consistently realigning reads around indels. The Unified Genotyper identifies variants from this prepared alignment file, then uses these variants along with known true sites for assigning quality scores and filtering to a final set of calls:
Messaging approach to parallel execution
The process diagrams illustrate points of parallel execution for each fastq file and sample analysis. Practically, a top level analysis server manages each of the sub-processes. A command line script, a LIMS system or a specialized Galaxy interface start this top level process. RabbitMQ messaging facilitates communication between the analysis controller and processing nodes:
In my previous post, CloudMan manages this entire process. The web interface controls a pre-configured SGE cluster and a custom script starts the job on this cluster. However, the general nature of the pipeline architecture allows this to work equally well on multiple core machines or a heterogeneous set of connected machines.
The CloudMan work demonstrates that clusters, especially on-demand virtual images like those available from Amazon, are be a powerful way to scale analyses. Equally important, it provides an open platform to share these pipelines and encourage re-use. The code for the pipeline is available from the bcbio-nextgen GitHub repository
Next generation sequencing technologies like Illumina, SOLiD and 454 have provided core facilities with the ability to produce large amounts of sequence data. Along with this increased output comes the challenge of managing requests and samples, tracking sequencing runs, and automating downstream analyses.
Our group at Massachusetts General Hospital approached these challenges by developing a sample submission and tracking interface on top of the web-based Galaxy data integration platform. It provides a front end for biologists to enter their sample details and monitor the status of a project. For lab technicians doing the sample preparation and sequencing work, the system tracks sample states via a set of progressive queues providing data entry points at each step of the process. On the back end, an automated analysis pipeline processes data as it arrives off the sequencer, uploading the results back into Galaxy.
This post will show videos of the interface in action, describe installation and extension of the system, and detail the implementation architecture.
Researcher sample entry
Biologists use a local Galaxy server as an entry point to submit samples for sequencing. This provides a familiar interface and central location for both entering sample information and retrieving and analyzing the sequencing data.
Practically, a user begins by browsing to the sample submission page. There they are presented with a wizard interface which guides them through entry of sample details. Multiplexed samples are supported through a drag and drop interface.
When all samples are entered, the user submits them as a sequencing project. This includes billing information and a project name to facilitate communication between the researcher and the core group about submissions. Users are able to view their submissions grouped as projects and track the state of constructs. Since we support a number of services in addition to sequencing — like library construction, quantitation and validation — this is a valuable way for users to track and organize their requests.
Sequencing tracking and management
Administrators and sequencing technicians have access to additional functionality to help manage the internal sample preparation and sequencing workflow. The main sample tracking interface centers around a set of queues; each queue represents a state that a sample can be in. Samples move through the queues as they are processed, with additional information being added to the sample at each step. For instance, a sample in the ‘Pre-sequencing quantitation’ queue moves to the ‘Sequencing’ queue once it has been fully quantitated, with that quantitation information entered by the sequencing technician during the transition.
Assigning samples to flow cells occurs using a drag and drop jQueryUI interface. The design is flexible to allow for placing samples across multiple lanes or multiplexing multiple barcoded samples into a single lane.
Viewing sequencing results
Running a sequencing machine requires careful monitoring of results and our interface provides several ways to view this data. Raw cluster and read counts are linked to a list of runs. For higher level analyses, interactive plots are available for viewing reads over time and pass rates compared to read density. These allow adjustment of experimental procedures to maximize useful reads based on current machine chemistry.
Utilizing a front end that organizes requests allows sequencing results to be processed through a fully automated analysis pipeline on the back end. The pipeline detects runs coming off of a sequencer, transfers files to storage and analysis machines and manages a number of processing steps:
- Alignment with bowtie or bwa.
- Generation of alignment and read statistics with Picard, the fastx toolkit and SolexaQA.
- Preparation of a summary PDF with detailed statistics about the run and alignment.
In addition to the default analysis, a full SNP calling pipeline is included with:
- GATK recalibration and realignment
- SNP identification with GATK’s unified genotyper
- Variant effect prediction with snpEff.
Fastq reads, alignments files, summary PDFs and other associated files are uploaded back into into Galaxy Data Libraries organized by sample names. Users can download results for offline work, or import them directly into their Galaxy history for further analysis or display.
Installing and extending
The code base is maintained as a Bitbucket repository that tracks the main galaxy-central distribution. It is updated from the main site regularly to maintain compatibility, with the future goal of integrating a generalized version into the main source tree. Detailed installation instructions are available for setting up the front-end client.
The analysis pipeline is written in Python and drives a number of open source programs; it is available as a GitHub repository with documentation and installation instructions.
We are using the current system in production and continue to develop and add features based on user feedback. We would like to generalize this for other research cores with additional instruments and services, and would be happy to hear from developers working on this type of system for their facilities.
This work would not have been possible without the great open source toolkits and frameworks that it builds on. Galaxy provides not only an analysis framework, but also a ready to use database structure for managing samples and requests. The front end builds off existing Galaxy sample tracking work, and requires only two new database storage tables.
The main change from the existing sample tracking framework is a generalization of the sample and request relationships. Requests can both contain samples, and be a part of samples so that a sequenced sample is organized as:
By reusing and expanding the great work of the Galaxy team, we hope to eventually integrate useful parts of this work into the Galaxy codebase.
My last post introduced a framework for building bioinformatics cloud images, which makes it easy to do biological computing work using Amazon EC2 and other on-demand computing providers. Since that initial announcement we’ve had amazing interest from the community and made great progress with:
- A permanent web site at cloudbiolinux.org
- Additional software and genomic data
- New user documentation
- A community coding session: Codefest 2010
New software and data
The most exciting changes have been the rapid expansion of installed software and libraries. The goal is to provide an image that experienced developers will find as useful as their custom configured servers. A great group of contributors have put together a large set of programs and libraries; the configuration files have all the details on installed programs as well as libraries for Python, Perl, Ruby, and R. Another addition is support for non-packaged programs which provides software not yet neatly wrapped in a package manger or library-specific install system: next-gen software packages like Picard, GATK and Bowtie are installed through custom scripts.
To improve accessibility for developers who prefer a desktop experience, a FreeNX server was integrated with the provided images. Tim Booth from the NEBC Bio-Linux team headed up the integration of FreeNX, and the user experience looks very similar to a locally installed Bio-Linux desktop.
In addition to the software image, a publicly available data volume is now available that contains:
- Genome sequences pre-indexed for search with next-gen aligners like Bowtie, Novoalign, and BWA.
- LiftOver files for mapping between sequence coordinates.
- UniRef protein databases, indexed for searching with BLAST+.
Coupled with the software images, this volume makes it easy to do next-gen analyses. Start up an Amazon AMI, attach the genome data volume, transfer your fastq file to the instance, and kick off the analysis. The overhead of software installation and genome indexing is completely removed. Thanks to the work of Enis Afgan and James Taylor of Galaxy, the data volume plugs directly into Galaxy’s ready to use cloud image. Coupling the data and software with Galaxy provides a familiar web interface for running tools and developing biological workflows.
The data volume preparation is fully automated via a fabric install script, similar to the software install script. Additional data sources are easily integrated, and we hope to expand the available datasets based on feedback from the community.
Documentation and presentations
The software and data volumes are only as good as the documentation which helps people use them:
- Bela Tiwari of the NEBC Bio-Linux team has written an excellent introduction to Amazon EC2 and CloudBioLinux. This breaks down the process of signing up for an account, creating a software image, associating data volumes and setting up a graphical server. It’s a great place to get started with CloudBioLinux.
- Ntino Krampis, from the JCVI Cloud Bio-Linux project, gave a presentation on CloudBioLinux explaining the motivation behind the project and providing usage examples.
- My presentation on the open source community behind CloudBioLinux from Amazon’s Genomic Data workshop. This details the project goals and automated code organization.
Community: Codefest 2010
The CloudBioLinux community had a chance to work together for two days in July at Codefest 2010. In conjunction with the Bioinformatics Open Source Conference (BOSC) in Boston, this was a free to attend coding session hosted at Harvard School of Public Health and Massachusetts General Hospital. Over 30 developers donated two days of their time to working on CloudBioLinux and other bioinformatics open source projects.
Many of the advances in CloudBioLinux detailed above were made possible through this session: the FreeNX graphical client integration, documentation, Galaxy interoperability, and many library and data improvements were started during the two days of coding and discussions. Additionally, the relationships developed are the foundation for better communication amongst open source projects, which is something we need to be continually striving for in the scientific computing world.
It was amazing and inspiring to get such positive feedback from so many members of the bioinformatics community. We’re planning another session next year in Vienna, again just before BOSC and ISMB 2011; and again, everyone is welcome.
Go to the CloudBioLinux website for the latest publicly available images and data volumes, which are ready to use on Amazon EC2. With Amazon’s new micro-images you can start analyzing data for only a few cents an hour. It’s an easy way to explore if cloud resources will help with computational demands in your work. We’re very interested in feedback and happy to have other developers helping out; please get in touch on the CloudBioLinux mailing list.
Amazon web services provide scalable, on demand computational resources through their elastic compute cloud (EC2). Previously, I described the goal of providing publicly available machine images loaded with bioinformatics tools. I’m happy to describe an initial step in that direction: an automated build system, using easily editable configuration files, that generates a bioinformatics-focused Amazon Machine Image (AMI) containing packages integrated from several existing efforts. The hope is to consolidate the community’s open source work around a single, continuously improving, machine image.
This image incorporates software from several existing AMIs:
- JCVI Cloud BioLinux — JCVI’s work porting Bio-Linux to the cloud.
- bioperl-max — Fortinbras’ package of BioPerl and associated informatics tools.
- MachetEC2 — An InfoChimps image loaded with data mining software.
Each of these libraries inspired different aspects of developing this image and associated infrastructure, and I’m extremely grateful to the authors for their code, documentation and discussions.
The current AMI is available for loading on EC2 — search for ‘CloudBioLinux’ in the AWS console or go to the CloudBioLinux project page for the latest AMIs. Automated scripts and configuration files with contained packages are available as a GitHub repository.
This image is intended as a starting point for developing a community resource that provides biology and data-mining oriented software. Experienced developers should be able to fire up this image and expect to find the same up to date libraries and programs they have installed on their work machines. If their favorite package is missing it should be quick and easy to add, making the improvement available to future developers.
Achieving these goals requires help and contributions from other programmers utilizing the cloud — everyone reading this. The current image is ready to be used, but is more complete in areas where I normally work. For instance, the Python and R libraries are off to a good start. I’d like to extend an invitation to folks with expertise in other areas to help improve the coverage of this AMI:
- Programmers: help expand the configuration files for your areas of interest:
- Perl CPAN support and libraries
- Ruby gems
- Java libraries
- Haskell hackage support and libraries
- Erlang libraries
- Bioinformatics areas of specialization:
- Next-gen sequencing
- Structural biology
- Parallelized algorithms
- Much more… Let us know what you are interested in.
- Documentation experts: provide cookbook style instructions to help others get started.
- Porting specialists: The automation infrastructure is dependent on having good ports for libraries and programs. Many widely used biological programs are not yet ported. Establishing a Debian or Ubuntu port for a missing program will not only help this effort, but make the programs more widely available.
- Systems administrators: The ultimate goal is to have the AMI be automatically updated on a regular basis with the latest changes. We’d like to set up an Amazon instance that pulls down the latest configuration, populates an image, builds the AMI, and then updates a central web page and REST API for getting the latest and greatest.
- Testers: Check that this runs on open source Eucalyptus clouds, additional linux distributions, and other cloud deployments.
If any of this sounds interesting, please get in contact. The Cloud BioLinux mailing list is a good central point for discussion.
In addition to supplying an image for downstream use, this implementation was designed to be easily extendible. Inspired by the MachetEC2 project, packages to be installed are entered into a set of easy to edit configuration files in YAML syntax. There are three different configuration file types:
- main.yaml — The high level configuration file defining which groups of packages to install. This allows a user to build a custom image simply by commenting out those groups which are not of interest.
- packages.yaml — Defines debian/ubuntu packages to be installed. This leans heavily on the work of DebianMed and Bio-Linux communities, as well as all of the hard working package maintainers for the distributions. If it exists in package form, you can list it here.
- python-libs.yaml, r-libs.yaml — These take advantage of language specific ways of installing libraries. Currently implemented is support for Python library installation from the Python package index, and R library installation from CRAN and Bioconductor. This will be expanded to include support for other languages.
We hope that the straightforward architecture of the build system will encourage other developers to dig in and provide additional coverage of program and libraries through the configuration files. For those comfortable with Python, the fabfile is very accessible for adding in new functionality.
If you are interested in face-to-face collaboration and will be in the Boston area on July 7th and 8th, check out Codefest 2010; it’ll be two enjoyable days of cloud informatics development. I’m looking forward to hearing from other developers who are interested in building and maintaining an easy to use, up to date, machine image that can help make biological computation more accessible to the community.
Google Summer of Code provides the unique opportunity for students to spend a summer working on open source projects and getting paid. Biopython was involved with two great projects last summer, and it’s time to apply for this year’s program: the student application period is from next Monday, March 29th to Friday, April 9th, 2010.
If you are a student interested in biology and open source work, there are two community organizations to look at for mentors and project ideas:
- NESCent Phyloinformatics — NESCent is a GSoC mentoring organization for the 4th year, focusing on projects related to phylogenetics and open source code.
- Open Bioinformatics Foundation — The umbrella organization that manages BioPerl, Biopython, BioJava, BioRuby and several other popular open source bioinformatics projects is involved with GSoC for the first time.
This year, I’ve collaborated on three project ideas centering around the idea of tool integration. An essential programming skill for dealing with large heterogeneous data sets is combining a set of tools in a way that abstracts out the implementation details, instead allowing you to focus on the high level biological questions. Bradford Cross, a machine learning and data crunching expert at FlightCaster, describes this process brilliantly in an interview at Data Wrangling.
These three project ideas allow a student to develop essential toolkit integration skills, while having the flexibility to work on biological questions relevant to their undergrad or graduate research:
- Biopython and PyCogent interoperability
- Phylogenetics pipeline development in Galaxy
- Building python APIs for R phylogenetic toolkits
All involve taking two or more different toolkits and combining the functionality into a higher level interface focused around ease of use. They are intentionally broad and flexible ideas, and a student proposal should concentrate on functionality most relevant to their biological questions. Ideally the work would be both a publicly available resource, and contribute directly to the student’s daily research.
If you’re interested in these ideas and in working with a set of great mentors, definitely get in touch with me either through the project mailing lists or directly. If none of these ideas strike your fancy but you would like to be involved with GSoC, get in touch with a mentor from one of the other project ideas at NESCent and OpenBio. It’s a unique opportunity to develop new coding skills, work with great mentors, and give back to the open source community.