Trimming adaptors from short read sequences
One common post processing step for next generation, or short read, sequencing is to remove adaptor sequences. Sample preparation for sequencing often involves selection of a particular fraction of interest like expressed genes, microRNAs, or copy number variant regions. Selecting, amplifying, and sequencing these regions can result in read sequences containing a common tag that needs to be identified and removed before continuing with other analysis steps like alignment to the genome. This is often the case when the reads of interest are small like in short RNAs or small sequence tags.
We start with a short known sequence (the adaptor) that will be present in our sequences of interest. For each read, we want to determine the position of this adaptor and remove it. Additionally, reads may contain the adaptor but differ by one or more bases; these differences are likely due to sequencing errors. So our solution needs to allow some fuzzy matching of the adaptor to the read sequences.
My first attempt involved using fuzzy string matching. In Python, several libraries exist to tackle this problem. The normal use case is in identifying misspelled words in text. For instance, you can calculate Levenshtein (edit) distance or use difflib in the standard library; Stack Overflow has a nice discussion of the different modules. Ultimately this approach proved to be a dead end; the properties that make English words similar do not overlap well with those differentiate DNA sequences.
That realization led me to tackle the problem by computing local alignments of the adaptor to each sequence. This effectively captures the biological intuition you need to trim a sequence with less than a specified number of errors. The downside to this rigorous approach is that it will be slower than purely string based methods.
The Biopython library contains a Python local alignment function suitable for quick alignment of short regions. We use this to align an adaptor region to a sequence and calculate the number of differences in the aligned region. To handle a common case where we can find the exact adaptor sequence, we first do an string match. This avoids the alignment overhead in many cases:
from Bio import pairwise2
def trim_adaptor(seq, adaptor, num_errors, right_side=True):
gap_char = '-'
exact_pos = str(seq).find(adaptor)
if exact_pos >= 0:
seq_region = str(seq[exact_pos:exact_pos+len(adaptor)])
adapt_region = adaptor
else:
seq_a, adaptor_a, score, start, end = pairwise2.align.localms(str(seq),
str(adaptor), 5.0, -4.0, -9.0, -0.5,
one_alignment_only=True, gap_char=gap_char)[0]
adapt_region = adaptor_a[start:end]
seq_region = seq_a[start:end]
matches = sum((1 if s == adapt_region[i] else 0) for i, s in
enumerate(seq_region))
# too many errors -- no trimming
if (len(adaptor) - matches) > num_errors:
return seq
# remove the adaptor sequence and return the result
else:
return _remove_adaptor(seq, seq_region.replace(gap_char, ""),
right_side)
If the alignment contains fewer than the specified number of errors we’ve found an adaptor and remove it, returning the trimmed sequence. The attribute right_side allows us to specify whether the trimming should be done on the right (3′ end) or the left (5′ end):
def _remove_adaptor(seq, region, right_side=True):
"""Remove an adaptor region and all sequence to the right or left.
"""
if right_side:
try:
pos = seq.find(region)
# handle Biopython SeqRecords
except AttributeError:
pos = seq.seq.find(region)
return seq[:pos]
else:
try:
pos = seq.rfind(region)
# handle Biopython SeqRecords
except AttributeError:
pos = seq.seq.rfind(region)
return seq[pos+len(region):]
Here is an example script using this function along with Biopython to trim reads from a FASTQ format file.Each read is analyzed to determine if it contains the adaptor, with up to 2 errors. If the adaptor is identified and trimmed, the final useful read is written to a FASTA format output file.
from __future__ import with_statement
import sys
import os
from Bio import SeqIO
from adaptor_trim import trim_adaptor
def main(in_file):
adaptor_seq = "TCGTATGCCGTCTTCTGC"
num_errors = 2
base, ext = os.path.splitext(in_file)
out_file = "%s-trimmed.fa" % (base)
with open(in_file) as in_handle:
with open(out_file, "w") as out_handle:
for rec in SeqIO.parse(in_handle, "fastq"):
trim = trim_adaptor(rec.seq, adaptor_seq, num_errors)
if len(trim) > 0 and len(trim) < len(rec):
rec.letter_annotations = {}
rec.seq = trim
SeqIO.write([rec], out_handle, "fasta")
You can get the full trimming code, along with tests, from GitHub. The script takes about a half hour to process 15 million 36bp reads. Other current trimming approaches are often integrated into the aligners themselves: for instance, NovoAlign appears to have similar alignment based trimming capabilities.

Never occurred to me to do that in Python, neat ;-) Got to admit I prefer ShortRead/Biostrings for this sort of processing, but who knows when this might turn out useful.
Oliver Hofmann
August 9, 2009 at 9:29 pm
Hi Brad,
Did you try slicing the SeqRecord instead of the Seq object? Like in the “Trimming off primer sequences” FASTQ example in the Tutorial:
http://biopython.org/DIST/docs/tutorial/Tutorial.html
Something like:
# def trim_adaptor(record, …) :
# #code skipped
# if diffs > num_errors:
# return record #no trimming
# if right_side:
# pos = record.seq.find(region)
# return record[:pos]
# else:
# pos = record.seq.rfind(region)
# return record[pos+len(region):]
Doing this will also slice the quality information, meaning you can then easily output trimmed FASTQ by changing the SeqIO.write line from “fasta” to “fastq”.
Plus this would make it easier to restructure your main loop into a generator expression, and then make a single call to SeqIO.write, which I would expect to be faster than making multiple calls:
# recs = (trim_adaptor(rec,adaptor_seq,num_errors) \
# for rec in SeqIO.parse(in_handle, “fastq”)
# SeqIO.write(recs, out_handle, “fastq”)
It would also be fairly simple to remove records which don’t have the adaptor sequence if you wanted to do that.
Peter
Peter
August 10, 2009 at 4:49 am
Peter;
Awesome — thanks for the ideas. I had an ugly decorator allowing passing either string or Seq objects, but should have been using the available slicing facilities. I updated the code and post; that makes it much cleaner. Thanks,
Brad
Brad Chapman
August 10, 2009 at 7:54 am
Peter;
Slicing the SeqRecords is much more elegant and generalizable, but gave poor performance on large files. If you don’t need to maintain the quality scores, removing the overhead of SeqRecord slicing is much quicker.
As a speed up, we can also check for the exact sequence first and only do alignments in cases where it isn’t found. Practically, this gave me an 8x boost. I updated the post and github code to reflect this ideas.
Thanks again,
Brad
Brad Chapman
August 12, 2009 at 11:19 am
I’ve replied again on the mailing list – this comment box isn’t very nice to writing a long message.
http://lists.open-bio.org/pipermail/biopython/2009-August/005430.html
Peter
August 12, 2009 at 6:23 pm
Oliver;
Good call — I forgot to mention the R/Bioconductor options. The ShortRead package allows trimming with the srdistance function:
http://www.bioconductor.org/packages/2.5/bioc/html/ShortRead.html
This uses edit distance. Biostrings uses alignments and has an example section on trimming:
http://www.bioconductor.org/packages/2.5/bioc/html/Biostrings.html
Brad
Brad Chapman
August 10, 2009 at 7:01 am
Follow up thread:
http://lists.open-bio.org/pipermail/biopython/2009-August/005417.html
Peter
August 10, 2009 at 6:18 am