Next generation sequencing information management and analysis system for Galaxy
Next generation sequencing technologies like Illumina, SOLiD and 454 have provided core facilities with the ability to produce large amounts of sequence data. Along with this increased output comes the challenge of managing requests and samples, tracking sequencing runs, and automating downstream analyses.
Our group at Massachusetts General Hospital approached these challenges by developing a sample submission and tracking interface on top of the web-based Galaxy data integration platform. It provides a front end for biologists to enter their sample details and monitor the status of a project. For lab technicians doing the sample preparation and sequencing work, the system tracks sample states via a set of progressive queues providing data entry points at each step of the process. On the back end, an automated analysis pipeline processes data as it arrives off the sequencer, uploading the results back into Galaxy.
This post will show videos of the interface in action, describe installation and extension of the system, and detail the implementation architecture.
Front-end usage
Researcher sample entry
Biologists use a local Galaxy server as an entry point to submit samples for sequencing. This provides a familiar interface and central location for both entering sample information and retrieving and analyzing the sequencing data.
Practically, a user begins by browsing to the sample submission page. There they are presented with a wizard interface which guides them through entry of sample details. Multiplexed samples are supported through a drag and drop interface.
When all samples are entered, the user submits them as a sequencing project. This includes billing information and a project name to facilitate communication between the researcher and the core group about submissions. Users are able to view their submissions grouped as projects and track the state of constructs. Since we support a number of services in addition to sequencing — like library construction, quantitation and validation — this is a valuable way for users to track and organize their requests.
Sequencing tracking and management
Administrators and sequencing technicians have access to additional functionality to help manage the internal sample preparation and sequencing workflow. The main sample tracking interface centers around a set of queues; each queue represents a state that a sample can be in. Samples move through the queues as they are processed, with additional information being added to the sample at each step. For instance, a sample in the ‘Pre-sequencing quantitation’ queue moves to the ‘Sequencing’ queue once it has been fully quantitated, with that quantitation information entered by the sequencing technician during the transition.
Assigning samples to flow cells occurs using a drag and drop jQueryUI interface. The design is flexible to allow for placing samples across multiple lanes or multiplexing multiple barcoded samples into a single lane.
Viewing sequencing results
Running a sequencing machine requires careful monitoring of results and our interface provides several ways to view this data. Raw cluster and read counts are linked to a list of runs. For higher level analyses, interactive plots are available for viewing reads over time and pass rates compared to read density. These allow adjustment of experimental procedures to maximize useful reads based on current machine chemistry.
Analysis pipeline
Utilizing a front end that organizes requests allows sequencing results to be processed through a fully automated analysis pipeline on the back end. The pipeline detects runs coming off of a sequencer, transfers files to storage and analysis machines and manages a number of processing steps:
- Alignment with bowtie or bwa.
- Generation of alignment and read statistics with Picard, the fastx toolkit and SolexaQA.
- Preparation of a summary PDF with detailed statistics about the run and alignment.
In addition to the default analysis, a full SNP calling pipeline is included with:
- GATK recalibration and realignment
- SNP identification with GATK’s unified genotyper
- Variant effect prediction with snpEff.
Fastq reads, alignments files, summary PDFs and other associated files are uploaded back into into Galaxy Data Libraries organized by sample names. Users can download results for offline work, or import them directly into their Galaxy history for further analysis or display.
Installing and extending
The code base is maintained as a Bitbucket repository that tracks the main galaxy-central distribution. It is updated from the main site regularly to maintain compatibility, with the future goal of integrating a generalized version into the main source tree. Detailed installation instructions are available for setting up the front-end client.
The analysis pipeline is written in Python and drives a number of open source programs; it is available as a GitHub repository with documentation and installation instructions.
We are using the current system in production and continue to develop and add features based on user feedback. We would like to generalize this for other research cores with additional instruments and services, and would be happy to hear from developers working on this type of system for their facilities.
Implementation details
This work would not have been possible without the great open source toolkits and frameworks that it builds on. Galaxy provides not only an analysis framework, but also a ready to use database structure for managing samples and requests. The front end builds off existing Galaxy sample tracking work, and requires only two new database storage tables.
The main change from the existing sample tracking framework is a generalization of the sample and request relationships. Requests can both contain samples, and be a part of samples so that a sequenced sample is organized as:
By reusing and expanding the great work of the Galaxy team, we hope to eventually integrate useful parts of this work into the Galaxy codebase.
Hi Brad,
this is amazing, great work.
I’ll try to install the software locally for testing.
Raoul Bonnal
January 12, 2011 at 4:36 am
This is amazing work, Brad. Letting my collaborators know straight away…
Jan Aerts
January 12, 2011 at 5:22 am
Great work and perfect timing for me (in the process of finding the right NGS lims). I already installed this fork last week, had some issues with the interface but was not able to reproduce (yet). But the videos show the great potential.
Btw, in the instructions you don’t mention (or I didn’t see it) that permissions should be set for the different request types to each user/role/group. So at first the ‘Lab’ tab was not visible and it seemed not to work ;)
Jelle Scholtalbers
January 12, 2011 at 6:14 am
Raoul, Jan and Jelle;
Thanks for the kind words.
Jelle;
Apologies about the lab visibility issue; this was a reversion when re-syncing to galaxy-main over the holiday break. The fix for that went in yesterday, so an hg pull && hg update should prevent having to set permissions.
If you manage to reproduce your interface problems, please do let me know. There were also a few display fixes put in last week, so it’s possible an update will get you back on track. Thanks again.
Brad Chapman
January 12, 2011 at 6:31 am
Great.
And where would you like to get the bugreports as I’m not able to see the issues tab on your bitbucket repo.
Jelle Scholtalbers
January 12, 2011 at 7:13 am
Thanks Jelle for the heads up; I turned issues on there. Also feel free to contact me via GitHub/Bitbucket and we can discuss issues by e-mail. Either works great.
Brad Chapman
January 12, 2011 at 7:43 am
[…] Next generation sequencing information management and analysis system for Galaxy « Blue Collar Bioi…. […]
Next generation sequencing information management and analysis system for Galaxy « Blue Collar Bioinformatics | Back of a Stamp
January 12, 2011 at 6:45 am
This is really excellent work! We have a 454 and it would be great if there was a version of this for that :) I’m developing for Galaxy and would be happy to help with 454 specifics etc…
Rachel Glover
January 12, 2011 at 8:30 am
just re-read that it makes it sound like I work for galaxy! I’m just another computational biologist trying to get my molecular buddies working with Galaxy and setting up our own Galaxy stuff :-D
Rachel Glover
January 12, 2011 at 8:31 am
Rachel;
It would be great to generalize this for 454 machines and workflows. If you can get the current version running and let me know the ways that things differ with your setup, we can work on generalizing it. I don’t have a good sense of 454 specifics, but am happy to learn. Feel free to send a direct message via bitbucket or GitHub if you’d like to discuss more. Thanks.
Brad Chapman
January 12, 2011 at 1:31 pm
Good organisation in the write-up.Should have included read lenght and base-call errors far in excess of traditional Sanger sequencing as challenges to next generation sequencing.Addition of how to tackle the challenges would make the work better.
Onyeacho Blessing
February 9, 2011 at 7:33 pm
I just installed the GalaxyLims locally. We are very interested in it.
But I found that under the administration menu, the “Manage requests” function is missing. Any suggestions?
Is there any way we can manually upload the fastq files to users accounts? Because we may use our own software to generate fastq files from the illumina raw data.
Thank you!
Xintao
February 10, 2011 at 4:18 pm
Xintao;
Thanks for taking a look at this. The ‘Manage requests’ functionality in the Admin menu is part of Galaxy’s native sample tracking software and is separate from this interface; I believe this has moved to ‘Sequencing requests’ in the menu if you’d like to investigate the native sample tracking further.
You can certainly use your own back-end analysis code with the front-end. Both are usable on their own. I’d also like to eventually generalize this to support taking fastq files from data libraries or user histories.
Thanks,
Brad
Brad Chapman
February 14, 2011 at 8:33 am
Hi,
We just have a HiSeq installed and I am very interested in using your pipeline for analysis and the galaxy integration for visualizing data. May I know if there is script automated the installation of required libraries for non-biolinux server (Centos 5.5 in our case)?
Paul
February 27, 2011 at 4:42 pm
Paul;
Congrats on the new HiSeq and thanks for the interest. The CloudBioLinux scripts do install packages based on Debian/Ubuntu names; I would like to eventually have an equivalent set of packages for RedHat derived systems.
What we do have now is custom installs for quite a bit of the software that should work on any Linux system:
https://github.com/chapmanb/bcbb/tree/master/ec2/biolinux
The Custom Packages section has more details, but you just need to do:
fab -f fabfile.py install_custom:your_package_name -H localhost
This configuration file has a list of the different software:
https://github.com/chapmanb/bcbb/blob/master/ec2/biolinux/config/custom.yaml
So it’s not a “one script” approach, but all the nuts and bolts are there. Let me know how this works for you and if you have any questions.
Brad Chapman
February 27, 2011 at 8:59 pm
Thanks. We are trying to set it up. By the way, is the system tailored for GAII? HiSeq seems handle the barcode information differently. Thanks.
Paul
March 23, 2011 at 9:57 pm
Paul;
It was originally designed on a GAII but we transitioned to a HiSeq recently. What problems are you running into with the barcoding?
From your previous question, I’ve been working on updating the scripts for installing the required software on CentOS. The CloudBioLinux install script:
https://github.com/chapmanb/bcbb/tree/master/ec2/biolinux
now can handle CentOS. You’ll need to adjust the server configuration to centos:
https://github.com/chapmanb/bcbb/blob/master/ec2/biolinux/config/fabricrc.txt
and then follow the instructions under ‘Targets for local builds.’
Let me know if you run into any problems.
Brad Chapman
March 24, 2011 at 6:53 am
Thanks Brad. That’s very useful. Let me try it first and write to you if I have any problems there.
As for the barcoding, I am not sure if it’s a setting problem or whatever since I am quite new to the system, our index read appears to be in an independent file, s_*_2_*_fastq.txt while read 1 in s_*_1_*_fastq.txt and read 2 in s_*_3_*_fastq.txt for a pair-end experiment. I think you have a better naming of the fastq file there and I can amend accordingly. I am not very sure for the bc_file.
Paul
March 24, 2011 at 2:25 pm
Paul;
Glad the build scripts will be helpful. For your barcoding issue, it sounds like this might be a problem with your CASAVA configuration since those s_*_sequence fastq files are produced through that pipeline. Our approach avoids this entirely by producing fastq files from the generated qseq files with this script:
https://github.com/chapmanb/bcbb/blob/master/nextgen/scripts/solexa_qseq_to_fastq.py
The 1,2,3 style Illumina barcodes are handled correctly. This script is part of the bigger pipeline which detects new flowcells being dumped, processes bcl to qseq files, generates fastq, and then transfers them to processing:
https://github.com/chapmanb/bcbb/blob/master/nextgen/scripts/illumina_finished_msg.py
More general details are in the README file:
https://github.com/chapmanb/bcbb/tree/master/nextgen
Hope this helps
Brad Chapman
March 25, 2011 at 8:43 am
Thanks again Brad. We had a slightly structure and I overlooked your script. Thanks for your help!
Paul
March 25, 2011 at 8:59 am
By the way, in the illumina_finished_msg.py script, it requires a galaxy_config file, where should I look further to disable the connection with galaxy at the moment? We’d love to include the interaction with galaxy, but I want to implement the analysis pipeline first. Thanks again.
Paul
March 29, 2011 at 3:15 pm
Paul;
What we use from the Galaxy configuration files are the details about the RabbitMQ server for communication.
There is a minimal example configuration in the config directory:
https://github.com/chapmanb/bcbb/blob/master/nextgen/config/universe_wsgi.ini
where you will need to adjust the server, username and password for RabbitMQ. Otherwise the post-processing and transfer is fully separate from Galaxy. Hope this helps
Brad Chapman
March 30, 2011 at 5:42 am
How can I only try out the post-processing and transfer then? I was trying to run illumina_finished_msg.py but it seems it was looking for more than universe_wsgi.ini and transfer_info.yaml. I guess I didn’t set up the msg_db correctly.
Paul
April 5, 2011 at 3:25 am
Paul;
Those two configuration files are what you need. Can you let me know errors you are seeing? If it’s easier to debug via GitHub than blog comments, feel free to send me a message or post an issue:
https://github.com/chapmanb/bcbb
Brad Chapman
April 5, 2011 at 6:28 am
Thanks Brad. I don’t think there is a bug. I think I didn’t set the msg_db probably. Let me follow up the issue at github.
Thanks for your help. Your tool is very useful to us.
Paul
April 5, 2011 at 9:21 am
[…] Next generation sequencing information management and analysis system for Galaxy – Next generation sequencing technologies like Illumina, SOLiD and 454 have provided core facilities with the ability to produce large amounts of sequence data. Along with this increased output comes the challenge of managing requests and samples, tracking sequencing runs, and automating downstream analyses. Our group at Massachusetts General Hospital approached these challenges by developing a sample submission and tracking interface on top of the web-based Galaxy data integration platform. It provides a front end for biologists to enter their sample details and monitor the status of a project. For lab technicians doing the sample preparation and sequencing work, the system tracks sample states via a set of progressive queues providing data entry points at each step of the process. On the back end, an automated analysis pipeline processes data as it arrives off the sequencer, uploading the results back into Galaxy. … […]
Blue Collar Bioinformatics « malariagen informatics
August 2, 2011 at 6:44 am
Hi Brad,
Thanks for the great work presented here. Please forgive this naive question, but are nglims and the analysis pipeline described above included and working out of the box in CloudBioLinux?
Regards,
Aaron.
benchtoclinic
February 4, 2012 at 4:18 am
Aaron;
The analysis pipeline works with CloudMan plus CloudBioLinux. This blog post describes how to start it up and run the pipeline from a custom interface similar to the LIMS system:
https://bcbio.wordpress.com/2011/11/29/making-next-generation-sequencing-analysis-pipelines-easier-with-biocloudcentral-and-galaxy-integration/
The LIMS isn’t part of this since I figured that would be more useful on a local installation associated with a sequencer than on Amazon, but getting it in place there is a matter of swapping out the Galaxy instance used. If you decided you want that, this is the Galaxy codebase you want:
https://bitbucket.org/chapmanb/galaxy-central
Hope this helps.
Brad Chapman
February 4, 2012 at 6:14 am
Cheers Brad, much appreciated. I was thinking of using CBL and CloudMan for demonstration purposes to try and get the go-ahead for a local install, but I guess I can also do that with a local VM anyway; just a little more work to get the analysis pipeline going (index, loc file set-up, etc.)
Aaron.
benchtoclinic
February 7, 2012 at 5:45 am
Hi Brad,
Thank you so much for this amazing work. I would like your guidance and help in pointing me to correct steps.
I would like to use your jQuery sample LIMS along with galaxy’s NGS analysis pipelne (same exact work flow diagram which you have shown) in our cluster (not in amazon). In our lab currently we have HiSeq 1000 and we will be getting HiSeq 2500 and Miseq soon as well. Currently, we use diff. propitiatory software but we also like to use above work flow.
is there any repository has these packages synced alogn with your Sample LIMS and NGS pipeline using below packages?
Picard
bowtie or bwa
FastQC
GATK (version 1.2; only for variant calling)
snpEff (version 2.0.2; only for variant calling)
thank you so much,
SJ
SJ
February 18, 2012 at 3:16 pm
SJ,
Thank you for the positive feedback. We use this associated with HiSeqs and on local clusters so it’s definitely useful outside of Amazon infrastructure.
The CloudBioLinux framework contains build targets to help install associated software and works fine on local machines. That’s what we use to install and maintain the software not found in system package repositories:
https://github.com/chapmanb/cloudbiolinux
The README has more details in the Targets for local builds/Custom package installs section. I hope this is what you are looking for.
Brad Chapman
February 18, 2012 at 9:35 pm
Hi Brad,
Thank you so much. Sure, I will look in to these cloudbiolinux. I am guessing it does come with your Galaxy integrated jQuery based LIMS correct?
Thanks,
SJ
SJ
February 21, 2012 at 1:06 pm
SJ;
CloudBioLinux is designed as a general platform so wouldn’t contain very specific code like this. However, it integrates nicely with CloudMan from the Galaxy team. This allows you to run any Galaxy version, including this LIMS fork directly. This recent post has more details on doing this:
https://bcbio.wordpress.com/2011/11/29/making-next-generation-sequencing-analysis-pipelines-easier-with-biocloudcentral-and-galaxy-integration/
This is the easiest path, but is Amazon specific. For a local install, you’d use CloudBioLinux to install dependencies you want (or install manually), then clone and start up the LIMS Galaxy instance.
Hope this helps.
Brad Chapman
February 21, 2012 at 8:27 pm
Hi Brad,
Sure I will check it out .
thank you so much for your reply.
SJ
SJ
February 23, 2012 at 8:35 am