Trimming adaptors from short read sequences

One common post processing step for next generation, or short read, sequencing is to remove adaptor sequences. Sample preparation for sequencing often involves selection of a particular fraction of interest like expressed genes, microRNAs, or copy number variant regions. Selecting, amplifying, and sequencing these regions can result in read sequences containing a common tag that needs to be identified and removed before continuing with other analysis steps like alignment to the genome. This is often the case when the reads of interest are small like in short RNAs or small sequence tags.

We start with a short known sequence (the adaptor) that will be present in our sequences of interest. For each read, we want to determine the position of this adaptor and remove it. Additionally, reads may contain the adaptor but differ by one or more bases; these differences are likely due to sequencing errors. So our solution needs to allow some fuzzy matching of the adaptor to the read sequences.

My first attempt involved using fuzzy string matching. In Python, several libraries exist to tackle this problem. The normal use case is in identifying misspelled words in text. For instance, you can calculate Levenshtein (edit) distance or use difflib in the standard library; Stack Overflow has a nice discussion of the different modules. Ultimately this approach proved to be a dead end; the properties that make English words similar do not overlap well with those differentiate DNA sequences.

That realization led me to tackle the problem by computing local alignments of the adaptor to each sequence. This effectively captures the biological intuition you need to trim a sequence with less than a specified number of errors. The downside to this rigorous approach is that it will be slower than purely string based methods.

The Biopython library contains a Python local alignment function suitable for quick alignment of short regions. We use this to align an adaptor region to a sequence and calculate the number of differences in the aligned region. To handle a common case where we can find the exact adaptor sequence, we first do an string match. This avoids the alignment overhead in many cases:

from Bio import pairwise2

def trim_adaptor(seq, adaptor, num_errors, right_side=True):
    gap_char = '-'
    exact_pos = str(seq).find(adaptor)
    if exact_pos >= 0:
        seq_region = str(seq[exact_pos:exact_pos+len(adaptor)])
        adapt_region = adaptor
    else:
        seq_a, adaptor_a, score, start, end = pairwise2.align.localms(str(seq),
                str(adaptor), 5.0, -4.0, -9.0, -0.5,
                one_alignment_only=True, gap_char=gap_char)[0]
        adapt_region = adaptor_a[start:end]
        seq_region = seq_a[start:end]
    matches = sum((1 if s == adapt_region[i] else 0) for i, s in
            enumerate(seq_region))
    # too many errors -- no trimming
    if (len(adaptor) - matches) > num_errors:
        return seq
    # remove the adaptor sequence and return the result
    else:
        return _remove_adaptor(seq, seq_region.replace(gap_char, ""),
                right_side)

If the alignment contains fewer than the specified number of errors we’ve found an adaptor and remove it, returning the trimmed sequence. The attribute right_side allows us to specify whether the trimming should be done on the right (3′ end) or the left (5′ end):

def _remove_adaptor(seq, region, right_side=True):
    """Remove an adaptor region and all sequence to the right or left.
    """
    if right_side:
        try:
            pos = seq.find(region)
        # handle Biopython SeqRecords
        except AttributeError:
            pos = seq.seq.find(region)
        return seq[:pos]
    else:
        try:
            pos = seq.rfind(region)
        # handle Biopython SeqRecords
        except AttributeError:
            pos = seq.seq.rfind(region)
        return seq[pos+len(region):]

Here is an example script using this function along with Biopython to trim reads from a FASTQ format file.Each read is analyzed to determine if it contains the adaptor, with up to 2 errors. If the adaptor is identified and trimmed, the final useful read is written to a FASTA format output file.

from __future__ import with_statement
import sys
import os

from Bio import SeqIO

from adaptor_trim import trim_adaptor

def main(in_file):
    adaptor_seq = "TCGTATGCCGTCTTCTGC"
    num_errors = 2
    base, ext = os.path.splitext(in_file)
    out_file = "%s-trimmed.fa" % (base)
   
    with open(in_file) as in_handle:
        with open(out_file, "w") as out_handle:
            for rec in SeqIO.parse(in_handle, "fastq"):
                trim = trim_adaptor(rec.seq, adaptor_seq, num_errors)
                if len(trim) > 0 and len(trim) < len(rec):
                    rec.letter_annotations = {}
                    rec.seq = trim
                    SeqIO.write([rec], out_handle, "fasta")

You can get the full trimming code, along with tests, from GitHub. The script takes about a half hour to process 15 million 36bp reads. Other current trimming approaches are often integrated into the aligners themselves: for instance, NovoAlign appears to have similar alignment based trimming capabilities.

Written by Brad Chapman

August 9, 2009 at 8:18 pm

Posted in OpenBio

Tagged with bioinformatics, biopython, ngs, python

11 Responses

Subscribe to comments with RSS.

Never occurred to me to do that in Python, neat ;-) Got to admit I prefer ShortRead/Biostrings for this sort of processing, but who knows when this might turn out useful.

Oliver Hofmann

August 9, 2009 at 9:29 pm

Reply
- Hi Brad,
  
  Did you try slicing the SeqRecord instead of the Seq object? Like in the “Trimming off primer sequences” FASTQ example in the Tutorial:
  http://biopython.org/DIST/docs/tutorial/Tutorial.html
  
  Something like:
  
  # def trim_adaptor(record, …) :
  # #code skipped
  # if diffs > num_errors:
  # return record #no trimming
  # if right_side:
  # pos = record.seq.find(region)
  # return record[:pos]
  # else:
  # pos = record.seq.rfind(region)
  # return record[pos+len(region):]
  
  Doing this will also slice the quality information, meaning you can then easily output trimmed FASTQ by changing the SeqIO.write line from “fasta” to “fastq”.
  
  Plus this would make it easier to restructure your main loop into a generator expression, and then make a single call to SeqIO.write, which I would expect to be faster than making multiple calls:
  
  # recs = (trim_adaptor(rec,adaptor_seq,num_errors) \
  # for rec in SeqIO.parse(in_handle, “fastq”)
  # SeqIO.write(recs, out_handle, “fastq”)
  
  It would also be fairly simple to remove records which don’t have the adaptor sequence if you wanted to do that.
  
  Peter
  
  Peter
  
  August 10, 2009 at 4:49 am
  
  Reply
  - Peter;
    Awesome — thanks for the ideas. I had an ugly decorator allowing passing either string or Seq objects, but should have been using the available slicing facilities. I updated the code and post; that makes it much cleaner. Thanks,
    
    Brad
    
    Brad Chapman
    
    August 10, 2009 at 7:54 am
    
    Reply
    - Peter;
      Slicing the SeqRecords is much more elegant and generalizable, but gave poor performance on large files. If you don’t need to maintain the quality scores, removing the overhead of SeqRecord slicing is much quicker.
      
      As a speed up, we can also check for the exact sequence first and only do alignments in cases where it isn’t found. Practically, this gave me an 8x boost. I updated the post and github code to reflect this ideas.
      
      Thanks again,
      Brad
      
      Brad Chapman
      
      August 12, 2009 at 11:19 am
      
      Reply
      - I’ve replied again on the mailing list – this comment box isn’t very nice to writing a long message.
        http://lists.open-bio.org/pipermail/biopython/2009-August/005430.html
        
        Peter
        
        August 12, 2009 at 6:23 pm
- Oliver;
  Good call — I forgot to mention the R/Bioconductor options. The ShortRead package allows trimming with the srdistance function:
  
  http://www.bioconductor.org/packages/2.5/bioc/html/ShortRead.html
  
  This uses edit distance. Biostrings uses alignments and has an example section on trimming:
  
  http://www.bioconductor.org/packages/2.5/bioc/html/Biostrings.html
  
  Brad
  
  Brad Chapman
  
  August 10, 2009 at 7:01 am
  
  Reply
Follow up thread:
http://lists.open-bio.org/pipermail/biopython/2009-August/005417.html

Peter

August 10, 2009 at 6:18 am

Reply
Hello,

If I’m not mistaken, this script is aligning the adaptor from the left to the right part of the sequence. I’ve noticed that the Galaxy tool (you are talking about in this article: https://bcbio.wordpress.com/2009/07/19/thoughts-from-bosc-2009-python-in-bioinformatics/ https://main.g2.bx.psu.edu/ : clip adapter sequences) is also trimming but aligns the adaptor from the right part of the sequence to the left.
Is there a way (option or change possible in the script) to do the same as the web-based tool ?

Regards,

Nicolas

Nicolas

April 24, 2012 at 6:21 am

Reply
- Nicolas;
  This script also trims from the right side/3′ end of the sequence, so matches the behavior of the Galaxy tool. This is the most common placement for adaptors. If you need something even more general this script for barcode trimming can handle both ends and trim from paired end reads:
  
  https://github.com/chapmanb/bcbb/blob/master/nextgen/scripts/barcode_sort_trim.py
  
  Hope this helps.
  
  Brad Chapman
  
  April 24, 2012 at 7:26 am
  
  Reply
  - Thanks you very much for your quick reply. I know that you can trim a sequence by removing an adaptor region and all sequence to the right or left but I don’t know how to start the alignment from the right side.
    If I take a concrete example to point out the different results with your script and the Galaxy tool:
    
    This is the sequence I’m working on:
    GCTGTTGGATGATGAGCAGCCCCTGCGGCTGGCGTA
    This is the adaptor sequence I’m using:
    CCCCAGCCCC
    I run the adaptor_trim.py script with a maximum number of errors of 3 (Galaxy tool doesn’t have the option of maximum number of errors, not to my knowledge at least).
    
    Here is the result I get with the adaptor_trim.py script:
    GCTGTTGGATGATGAG
    And here is the result with the Galaxy tool:
    GCTGTTGGATGATGAGCAG
    
    We can see that there are two different possible matching sequences close to the adaptor sequence (with less or 3 errors). The adaptor_trim.py, which is aligning from the right to the left will find the first matching sequence from the right and the Galaxy tool will find the first matching sequence from the left.
    
    This is how I understand it and explain the different result, but maybe I’m wrong and would appreciate a correction.
    
    Do you know how I could get the result from the Galaxy tool with your script?
    
    Regards,
    
    Nicolas
    
    Nicolas
    
    April 24, 2012 at 8:44 am
    
    Reply
    - Nicolas;
      Thanks for the example, this makes sense now. The code does a local alignment of the adaptor against the full sequence. In this case, the scoring metric prefers the shortened CAGCCCC match over the longer CCCCAGCCCC match with 3 internal errors.
      
      We’d have to dig into the Galaxy code to see how they do the adaptor trimming but I don’t have a good way to replicate short of custom adjustment of the parameters for local alignment:
      
      https://github.com/chapmanb/bcbb/blob/master/align/adaptor_trim.py#L89
      
      Allowing 3 errors for such a short adaptor is going to give you lots of problematic cases like this since that’s pretty degenerate. Could you provide a larger adaptor sequence to better identify the region you’re looking for?
      
      Brad Chapman
      
      April 24, 2012 at 3:34 pm
      
      Reply

Blue Collar Bioinformatics