Posts Tagged ‘gsoc’
Google Summer of Code provides the unique opportunity for students to spend a summer working on open source projects and getting paid. Biopython was involved with two great projects last summer, and it’s time to apply for this year’s program: the student application period is from next Monday, March 29th to Friday, April 9th, 2010.
If you are a student interested in biology and open source work, there are two community organizations to look at for mentors and project ideas:
- NESCent Phyloinformatics — NESCent is a GSoC mentoring organization for the 4th year, focusing on projects related to phylogenetics and open source code.
- Open Bioinformatics Foundation — The umbrella organization that manages BioPerl, Biopython, BioJava, BioRuby and several other popular open source bioinformatics projects is involved with GSoC for the first time.
This year, I’ve collaborated on three project ideas centering around the idea of tool integration. An essential programming skill for dealing with large heterogeneous data sets is combining a set of tools in a way that abstracts out the implementation details, instead allowing you to focus on the high level biological questions. Bradford Cross, a machine learning and data crunching expert at FlightCaster, describes this process brilliantly in an interview at Data Wrangling.
These three project ideas allow a student to develop essential toolkit integration skills, while having the flexibility to work on biological questions relevant to their undergrad or graduate research:
- Biopython and PyCogent interoperability
- Phylogenetics pipeline development in Galaxy
- Building python APIs for R phylogenetic toolkits
All involve taking two or more different toolkits and combining the functionality into a higher level interface focused around ease of use. They are intentionally broad and flexible ideas, and a student proposal should concentrate on functionality most relevant to their biological questions. Ideally the work would be both a publicly available resource, and contribute directly to the student’s daily research.
If you’re interested in these ideas and in working with a set of great mentors, definitely get in touch with me either through the project mailing lists or directly. If none of these ideas strike your fancy but you would like to be involved with GSoC, get in touch with a mentor from one of the other project ideas at NESCent and OpenBio. It’s a unique opportunity to develop new coding skills, work with great mentors, and give back to the open source community.
Google Summer of Code is a brilliant program that provides a summer stipend for students to work on open source projects. This year, I have been evaluating Biopython student proposals under the watchful eye of The National Evolutionary Synthesis Center (NESCent). Biopython was lucky enough to have interest from several students, resulting in some excellent proposals. Two of these were selected for funding:
- Eric Talevich’s proposal on adding support to Biopython for PhyloXML, an XML format for representing phylogenetic trees and associated data. I will be the primary mentor for this project, with Christian Zmasek, the author of PhyloXML, providing plenty of capable secondary mentoring.
- Nick Matzke’s proposal on Biogeographical Phylogenetics. This is an integrated project that involves extracting biodiversity data from the web, merging it with phylogenetic data, and providing analysis and display libraries. Nick helped assemble a full team of mentors, and I will be providing input on integrating his code with Biopython while learning about Biogeography along the way.
This is my first year working with summer of code, and I could not be more impressed by the professionalism and hard work of the organizers at both Google and NESCent. The program is designed from the ground up to be simultaneously inclusive and rigorously selective. Students are asked to prepare detailed project plans to demonstrate they understand the subject, are able programmers, and can communicate effectively. There is also a strong eye given to students who are likely to continue on in their open source communities after the summer; after all, open source work is really built on the free labor of many dedicated people.
Having worked on open source projects for several years, it is invigorating and heartening to see a program designed to encourage the next generation of open source leaders. My time on the program thus far has been a great learning experience, and I am looking forward to a productive summer.
The BioSQL project provides a well thought out relational database schema for storing biological sequences and annotations. For those developers who are responsible for setting up local stores of biological data, BioSQL provides a huge advantage via reusability. Some of the best features of BioSQL from my experience are:
- Available interfaces for several languages (via Biopython, BioPerl, BioJava and BioRuby).
- Flexible storage of data via a key/value pair model. This models information in an extensible manner, and helps with understanding distributed key/value stores like SimpleDB and CouchDB.
- Overall data model based on GenBank flat files. This makes teaching the model to biology oriented users much easier; you can pull up a text file from NCBI with a sequence and directly show how items map to the database.
Given the usefulness of BioSQL for local relational data storage, I would like to see it move into the rapidly expanding cloud development community. Tying BioSQL data storage in with Web frameworks will help researchers make their data publicly available earlier and in standard formats. As a nice recent example, George has a series of posts on using BioSQL with Ruby on Rails. There have also been several discussions of the BioSQL mailing list around standard web tools and APIs to sit on top of the database; see this thread for a recent example.
Towards these goals, I have been working on a BioSQL backed interface for Google App Engine. Google App Engine is a Python based framework to quickly develop and deploy web applications. For data storage, Google’s Datastore provides an object interface to a distributed scalable storage backend. Practically, App Engine has free hosting quotas which can scale to larger instances as demand for the data in the application increases; this will appeal to cost-conscious researchers by avoiding an initial barrier to making their data available.
My work on this was accelerated by the Open Bioinformatics Foundation’s move to apply for participation in Google’s Summer of Code. OpenBio is a great community that helps organize projects like BioPerl, Biopython, BioJava and BioRuby. After writing up a project idea for BioSQL on Google App Engine in our application, I was inspired to finish a demonstration of the idea.
I am happy to announce a simple demonstration server running a BioSQL based backend: BioSQL Web. The source code is available from my git repository. Currently the server allows uploads of GenBank formatted files and provides a simple view of the records, annotations and sequences. The data is stored in the Google Datastore with an object interface that mimics the BioSQL relational model.
Future posts will provide more details on the internals of the server end and client interface as they develop. As always, feedback, thoughts and code contributions are very welcome.