Blue Collar Bioinformatics

Note: new posts have moved to http://bcb.io/ Please look there for the latest updates and comments

Archive for the ‘OpenBio’ Category

Trimming adaptors from short read sequences

with 11 comments

One common post processing step for next generation, or short read, sequencing is to remove adaptor sequences. Sample preparation for sequencing often involves selection of a particular fraction of interest like expressed genes, microRNAs, or copy number variant regions. Selecting, amplifying, and sequencing these regions can result in read sequences containing a common tag that needs to be identified and removed before continuing with other analysis steps like alignment to the genome. This is often the case when the reads of interest are small like in short RNAs or small sequence tags.

We start with a short known sequence (the adaptor) that will be present in our sequences of interest. For each read, we want to determine the position of this adaptor and remove it. Additionally, reads may contain the adaptor but differ by one or more bases; these differences are likely due to sequencing errors. So our solution needs to allow some fuzzy matching of the adaptor to the read sequences.

My first attempt involved using fuzzy string matching. In Python, several libraries exist to tackle this problem. The normal use case is in identifying misspelled words in text. For instance, you can calculate Levenshtein (edit) distance or use difflib in the standard library; Stack Overflow has a nice discussion of the different modules. Ultimately this approach proved to be a dead end; the properties that make English words similar do not overlap well with those differentiate DNA sequences.

That realization led me to tackle the problem by computing local alignments of the adaptor to each sequence. This effectively captures the biological intuition you need to trim a sequence with less than a specified number of errors. The downside to this rigorous approach is that it will be slower than purely string based methods.

The Biopython library contains a Python local alignment function suitable for quick alignment of short regions. We use this to align an adaptor region to a sequence and calculate the number of differences in the aligned region. To handle a common case where we can find the exact adaptor sequence, we first do an string match. This avoids the alignment overhead in many cases:

from Bio import pairwise2

def trim_adaptor(seq, adaptor, num_errors, right_side=True):
    gap_char = '-'
    exact_pos = str(seq).find(adaptor)
    if exact_pos >= 0:
        seq_region = str(seq[exact_pos:exact_pos+len(adaptor)])
        adapt_region = adaptor
    else:
        seq_a, adaptor_a, score, start, end = pairwise2.align.localms(str(seq),
                str(adaptor), 5.0, -4.0, -9.0, -0.5,
                one_alignment_only=True, gap_char=gap_char)[0]
        adapt_region = adaptor_a[start:end]
        seq_region = seq_a[start:end]
    matches = sum((1 if s == adapt_region[i] else 0) for i, s in
            enumerate(seq_region))
    # too many errors -- no trimming
    if (len(adaptor) - matches) > num_errors:
        return seq
    # remove the adaptor sequence and return the result
    else:
        return _remove_adaptor(seq, seq_region.replace(gap_char, ""),
                right_side)

If the alignment contains fewer than the specified number of errors we’ve found an adaptor and remove it, returning the trimmed sequence. The attribute right_side allows us to specify whether the trimming should be done on the right (3′ end) or the left (5′ end):

def _remove_adaptor(seq, region, right_side=True):
    """Remove an adaptor region and all sequence to the right or left.
    """
    if right_side:
        try:
            pos = seq.find(region)
        # handle Biopython SeqRecords
        except AttributeError:
            pos = seq.seq.find(region)
        return seq[:pos]
    else:
        try:
            pos = seq.rfind(region)
        # handle Biopython SeqRecords
        except AttributeError:
            pos = seq.seq.rfind(region)
        return seq[pos+len(region):]

Here is an example script using this function along with Biopython to trim reads from a FASTQ format file.Each read is analyzed to determine if it contains the adaptor, with up to 2 errors. If the adaptor is identified and trimmed, the final useful read is written to a FASTA format output file.

from __future__ import with_statement
import sys
import os

from Bio import SeqIO

from adaptor_trim import trim_adaptor

def main(in_file):
    adaptor_seq = "TCGTATGCCGTCTTCTGC"
    num_errors = 2
    base, ext = os.path.splitext(in_file)
    out_file = "%s-trimmed.fa" % (base)
   
    with open(in_file) as in_handle:
        with open(out_file, "w") as out_handle:
            for rec in SeqIO.parse(in_handle, "fastq"):
                trim = trim_adaptor(rec.seq, adaptor_seq, num_errors)
                if len(trim) > 0 and len(trim) < len(rec):
                    rec.letter_annotations = {}
                    rec.seq = trim
                    SeqIO.write([rec], out_handle, "fasta")

You can get the full trimming code, along with tests, from GitHub. The script takes about a half hour to process 15 million 36bp reads. Other current trimming approaches are often integrated into the aligners themselves: for instance, NovoAlign appears to have similar alignment based trimming capabilities.

Written by Brad Chapman

August 9, 2009 at 8:18 pm

Posted in OpenBio

Tagged with , , ,

Sorting genomic alignments using Python

leave a comment »

Recent, I was asked to sort a MAF (Multiple Alignment Format) file containing genomic alignments. This was a large vertebrate alignment file and the researchers were interested in manually examining the longest matches. The tricky part of doing this is that the entire file cannot comfortably fit in memory on a standard machine.

The common solution to dealing with large memory intensive files is to pre-parse the file and provide a way to look up individual records based on an identifier. For instance, items could be pushed into a relational database table and retrieved based on primary keys. While moving to disk access is less efficient, it allows your code to scale to these type of real life problems without resorting to buying large memory machines.

We will use the bx-python library to read the MAF alignment files, prepare an index for accessing the alignments individually, and ultimately perform the sorting.

The first step is to define the index retrieval method. bx-python has index retrival functionality that allows querying and retrieving records based on genomic location. We don’t need that advanced query capability here, and can retrieve records based on position in the file as well. We build the index by looping through each record and recording the file position and genomic locations in the index.

from bx.align import maf
from bx import interval_index_file

def build_index(in_file, index_file):
    indexes = interval_index_file.Indexes()
    with open(in_file) as in_handle:
        reader = maf.Reader(in_handle)
        while 1:
            pos = reader.file.tell()
            rec = reader.next()
            if rec is None:
                break
            for c in rec.components:
                indexes.add(c.src, c.forward_strand_start,
                        c.forward_strand_end, pos, max=c.src_size )

    with open(index_file, "w") as index_handle:
        indexes.write(index_handle)

Now we have a unique index to retrieve a record — the position in the file. The second step is to loop through the file again and build a list of alignment sizes and positions. Alignment size is the first value in each list item, since this is the criteria to sort by. This list easily fits in memory, and we sort it with the largest alignments first.

rec_info = []
with open(in_file) as in_handle:
    reader = maf.Reader(in_handle)
    while 1:
        pos = reader.file.tell()
        rec = reader.next()
        if rec is None:
            break
        rec_info.append((rec.text_size, pos))
rec_info.sort(reverse=True)

Finally, we loop though the sorted list of sizes and use the associated positions to get records from our index. Each record is then written to an output file.

index = maf.Indexed(in_file, index_file)
with open(out_file, "w") as out_handle:
    writer = maf.Writer(out_handle)
    for size, pos in rec_info:
        rec = index.get_at_offset(pos)
        writer.write(rec)

The full script puts this together into a usable program. This could be adjusted to deal with other alignment formats supported by bx-python, like axt files.

Written by Brad Chapman

July 26, 2009 at 3:52 pm

Posted in OpenBio

Tagged with , ,

Thoughts from BOSC 2009; Python in bioinformatics

with 6 comments

I’d like to share my thoughts on two major themes that emerged from my trip to Sweden for the 2009 Bioinformatics Open Source Conference (BOSC). I talked briefly on publishing biological data on the web; you can check out the slides on Slideshare. This lead to discussion with several folks from the open source and Python bioinformatics communities. The major themes of these conversations were: organization within the Python bioinformatics community and growth of a platform for developing web enabled applications.

Python in Bioinformatics

One of the unique elements of the Python bioinformatics community is that the work is distributed amongst several different packages. Unlike Perl, where many programmers regularly consolidate their code into the BioPerl project, the Python community has settled on a few different packages: Biopython, bx-python, pygr and PyCogent are a few popular ones. Instead of working with a monolithic code base, users can pick functionality amongst several choices.

This distinctive organization is not a bad thing. It avoids creating an unwieldy package which can be hard to maintain and re-factor. It allows programmers to explore solutions in ways better suited to their particular problems. Finally, it provides individual recognition for the hard work researchers put into building and maintaining reusable code.

What the distribution does mean is that the Python bioinformatics community needs to work harder at communication and coordination. One great idea was to write a grant to help bring Python biology programmers together for a conference and hacking session, ideally alongside a conference like BOSC or SciPy. This would provide an impetus for contributors to learn and discuss each others platforms. Beyond goodwill and community, the deliverables would be documentation and code contributing to integration between projects. This would enable scientists by lowering the learning curves to producing useful biology related code in Python.

Galaxy

My talk focused on developing small, reusable presentation and backend components for building web enabled applications. One really insightful question asked whether the community should focus on building a platform these components could be plugged into, or if components themselves would eventually evolve towards a larger structure. The first idea was taken very successfully by the Firefox browser with their plugin architecture, which allows end users to build amazing web interfacing applications within the browser. The second approach is the one I am more used to taking: build relatively smaller things that work, with an eye towards integration.

A discussion with James Taylor convinced me that it was worthwhile to take a longer look at the Galaxy project. Galaxy is an excellent web based front end to many bioinformatics programs and scripts, allowing biologists to put together analysis pipelines. They have a powerful public site which means using Galaxy requires no installation for many use cases.

The code is also publicly available and you can run your own local Galaxy instances, plugging in custom programs through their XML based tool interface. The architecture of the system is remarkably similar to the one I converged on for my work. It features a Pylons like backend, SQLAlchemy for databases interactivity, and jQuery for the javascript interface. Deployment to cloud infrastructure can be done on Amazon EC2.

We have installed Galaxy locally and are taking it for a spin for our data presentation tasks. The tool plugin interface works as described and we have had good luck integrating it with custom input types. I will be trying more complex integrations with custom display and more Python code on the backend and hopefully have future posts covering that. Generally, I hope Galaxy can serve as a platform in which custom presentation code can be built, distributed, and reused.

I’d be happy to hear your thoughts about either the biology Python community or Galaxy as a platform for web presentation work.

Written by Brad Chapman

July 19, 2009 at 8:13 pm

Posted in OpenBio

Tagged with , ,

Talking at BOSC 2009 about publishing biological data on the web

with 3 comments

The Bioinformatics Open Source Conference (BOSC) is taking place later this month in Stockholm, Sweden. I will be attending for the first time in a few years, and giving a short and sweet 10 minute talk about ideas for publishing biological data on the web. BOSC provides a chance to meet and talk with many of the great people involved in open source bioinformatics; the schedule this year looks fantastic. The talk will be held in conjunction with The Data and Analysis Management special interest group, also full of interesting talks.

The talk will promote development of reusable web based interface libraries layered on top of existing open source projects. The PDF abstract provides the full description and motivation; below is a more detailed outline based on some brainstorming and organization:

  • Motivation: rapidly organize and display biological data in a web accessible format.
  • Current state: reusable bioinformatics libraries targeted at programmers — Biopython, bx-python, pygr, PyCogent
  • Current state: back end databases for storing biological data — BioSQL, GMOD
  • Current state: full featured web applications targeted at users — Galaxy, GBrowse
  • My situation: biologist and developer with organized data that needs analysis and presentation, internally with collaborators and externally with larger community.
  • Proposal: integrate bioinformatics libraries, database schemas, and open source web development frameworks to provide re-usable components that can serve as a base for custom data presentation.
  • Framework: utilize cloud infrastructure for reliable deployment — Google App Engine, Amazon EC2
  • Framework: make use of front end javascript frameworks — jQuery, ExtJS.
  • Framework: make use of back end web frameworks — Pylons
  • Implementation: Demo server for displaying sequences plus annotations
  • Implementation: Utilizes BioSQL schema, ported to object oriented data store; Google App engine backend or MongoDB backend
  • Implementation: Data import/export with Biopython libraries — GenBank in and GFF out
  • Implementation: Additional screenshots from internal web displays.
  • Challenges: Generalizing and organizing display and retrieval code without having to buy into a large framework.
  • Challenges: Re-usable components for cross-language functionality; javascript front end displays for multi-language back ends.
  • Challenges: Build a community that thinks of reusing and sharing display code as much as parsing and pipeline development code.

I would be happy to hear comments or suggestions about the talk. If you’re going to BOSC and want to meet up, definitely drop me a line.

Written by Brad Chapman

June 11, 2009 at 7:41 am

Python libraries for synthetic biology

with 10 comments

Researchers today are empowered with the ability to devise DNA sequences and order these designs for fabrication. Starting with known pieces from biological systems, variations can be conceptualized and explored with increasing ease. As a result, the field of synthetic biology has expanded along with techniques used to provide larger pieces of synthesized DNA. Companies such as DNA 2.0 and GENEART provide on-demand synthesis services. Additionally, many researchers design, optimize and generate custom DNA sequences in their own lab.

My previous job was at a synthetic biology start up working on the software side of their fabrication and design services. The company was unfortunately a recent victim of the economy. The learning from five years of in the trenches synthesis work will be very useful to others tackling similar projects. With that in mind, the Python libraries used for construction, sequencing verification, and many other synthetic biology related tasks are being made available on Bitbucket.

Some highlights of the library:

The code in this library was integrated as part of an automated synthesis platform. As a result, it does not feature ready to run scripts. Rather, the code can serve as a source for developing these type of tools. I would love to see the most useful parts refactored and integrated into existing software communities like Biopython. Please drop a note if you have any interest in collaborating on such a project.

Written by Brad Chapman

May 31, 2009 at 9:28 pm

Biopython projects for Google Summer of Code

leave a comment »

Google Summer of Code is a brilliant program that provides a summer stipend for students to work on open source projects. This year, I have been evaluating Biopython student proposals under the watchful eye of The National Evolutionary Synthesis Center (NESCent). Biopython was lucky enough to have interest from several students, resulting in some excellent proposals. Two of these were selected for funding:

  • Eric Talevich’s proposal on adding support to Biopython for PhyloXML, an XML format for representing phylogenetic trees and associated data. I will be the primary mentor for this project, with Christian Zmasek, the author of PhyloXML, providing plenty of capable secondary mentoring.
  • Nick Matzke’s proposal on Biogeographical Phylogenetics. This is an integrated project that involves extracting biodiversity data from the web, merging it with phylogenetic data, and providing analysis and display libraries. Nick helped assemble a full team of mentors, and I will be providing input on integrating his code with Biopython while learning about Biogeography along the way.

This is my first year working with summer of code, and I could not be more impressed by the professionalism and hard work of the organizers at both Google and NESCent. The program is designed from the ground up to be simultaneously inclusive and rigorously selective. Students are asked to prepare detailed project plans to demonstrate they understand the subject, are able programmers, and can communicate effectively. There is also a strong eye given to students who are likely to continue on in their open source communities after the summer; after all, open source work is really built on the free labor of many dedicated people.

Having worked on open source projects for several years, it is invigorating and heartening to see a program designed to encourage the next generation of open source leaders. My time on the program thus far has been a great learning experience, and I am looking forward to a productive summer.

Written by Brad Chapman

April 20, 2009 at 5:38 pm

Posted in OpenBio

Tagged with , ,

Examining and adjusting your GFF file

with 2 comments

Generic Feature Format (GFF) defines a standard template for representing biological features. Within this template, however, is room for flexibility. Different GFF producers may decide to format their data in slightly different ways. While parsing these files is not a problem, correctly interpreting and utilizing the data can be. The in-development Biopython GFF parser provides utilities to get a high level summary of the elements of a GFF file, and to adjust line items in the file during parsing.

Examining a GFF file

When downloading a new GFF file, the first step is getting an overview of the file contents. The GFF parser provides a GFFExaminer class to help with this. The first function of interest is available_limits:

import pprint
from BCBio.GFF.GFFParser import GFFExaminer
gff_examiner = GFFExaminer()
possible_limits = gff_examiner.available_limits(gff_file)
pprint.pprint(possible_limits)

It returns a dictionary defining various ways to limit your parsing of the GFF file, along with the count of each item. As an example, here is a trimmed dump from one of the test files. This has features on two different record ids — chromosomes ‘I’ and ‘X’, with 159 and 6 items, respectively. Also listed are the combination of the 2nd source column and 3rd type column, and the type column by itself:

{'gff_id': {('I',): 159,
            ('X',): 6},
 'gff_source_type': {('Coding_transcript', 'CDS'): 27,
                     ('Coding_transcript', 'exon'): 33,
                     ('Coding_transcript', 'five_prime_UTR'): 4,
                     ('Coding_transcript', 'gene'): 2,
                     ('Coding_transcript', 'intron'): 29,
                     ('Coding_transcript', 'mRNA'): 4,
                     ('Coding_transcript', 'three_prime_UTR'): 3,
                     ('mass_spec_genome', 'translated_nucleotide_match'): 7},
 'gff_type': {('CDS',): 57,
              ('exon',): 33,
              ('five_prime_UTR',): 4,
              ('gene',): 2,
              ('intron',): 29,
              ('mRNA',): 4,
              ('three_prime_UTR',): 3,
              ('translated_nucleotide_match',): 7}}

In addition to the overview of file contents, nested relationships are another important component of the file to understand. A summary of these is available through the parent_child_map function:

import pprint
from BCBio.GFF.GFFParser import GFFExaminer
gff_examiner = GFFExaminer()
pc_map = gff_examiner.parent_child_map(gff_file)
pprint.pprint(pc_map)

Again, a dictionary is returned. The keys in the dictionary are parent source and type elements in the file, while the values are children of those elements. For instance, here is the dictionary for a three tiered relationship where genes have mRNAs, and each mRNA can have coding regions, exons, introns, and 5′ and 3′ untranslated regions:

{('Coding_transcript', 'gene'): [('Coding_transcript', 'mRNA')],
 ('Coding_transcript', 'mRNA'): [('Coding_transcript', 'CDS'),
                                 ('Coding_transcript', 'exon'),
                                 ('Coding_transcript', 'five_prime_UTR'),
                                 ('Coding_transcript', 'intron'),
                                 ('Coding_transcript', 'three_prime_UTR')]}

Adjusting GFF lines during parsing

Occasionally you may run into a file format that you can comprehend, but that does not match your expectations for where items should be. SOLiD GFF files are one example; many thanks are due to David Schruth, who has been patiently working with me on the parsing of these files. They have the read name in the first column, which is normally used for the record ID the feature maps to. The actual record ID is contained as an attribute, i=1, where the 1 corresponds to the index of record it maps to in the original FASTA alignment file:

3_336_815_F3    solid   read    55409   55428   10.4    +       .       i=1

The GFFAddingIterator has an optional attribute, line_adjust_fn, which can be used to solve this problem. The function is called each time a line is read and passed a parsed dictionary representing the line. The dictionary for the above line is:

{'id': '',
 'is_gff2': False,
 'location': [55408, 55428],
 'quals': {
           'i': ['1'],
           'score': ['10.4'],
           'source': ['solid']},
 'rec_id': '3_336_815_F3',
 'strand': 1,
 'type': 'read'}

The function takes this item as an argument and returns the dictionary, but with any adjustments that need to be made. In our example, we look up the name of the record corresponding to the i=1 index. This name is swapped in for the rec_id, while that information moves to an attribute named read_name:

from Bio import SeqIO
from BCBio.GFF.GFFParser import GFFAddingIterator

class SolidFastaRemap:
    def __init__(self, initial_fasta):
        self._in_map = self._get_index_map(initial_fasta)

    def _get_index_map(self, in_file):
        in_map = dict()
        in_handle = open(in_file)
        for index, rec in enumerate(SeqIO.parse(in_handle, "fasta")):
            in_map[index] = rec.id
        in_handle.close()
        return in_map

    def adjust_fn(self, results):
        # 1-based indexes; convert to 0-based
        rec_index = int(results['quals']['i'][0]) - 1
        read_name = results['rec_id']
        results['quals']['read_name'] = [read_name]
        results['rec_id'] = self._in_map[rec_index]
        return results

remapper = SolidFastaRemap(fasta_file)
gff_iterator = GFFAddingIterator(line_adjust_fn=remapper.adjust_fn)
rec_dict = gff_iterator.get_all_features(gff_file)

This allows you to fix the file during the parsing, saving multiple passes through the file. This general functionality can be used to cleanly deal with any other inconsistencies that crop up during your GFF parsing adventures.

Written by Brad Chapman

April 12, 2009 at 8:55 pm

Posted in OpenBio

Tagged with ,