More python GFF parsing — iterative parsing and GFF2 nested features
Work on the python generic feature format (GFF) parser continues to push forward; many thanks to those who have provided feedback in helping to refine the functionality. Previously, we discussed the initial implementation, introduced MapReduce parsing for parallelization, and discussed deploying on a cluster and GFF2 parsing. This week describes an interface for iterator based parsing of GFF files and nested features for GFF2 files.
Iterative parsing of GFF
GFF files are line-based and related features can be located anywhere in the file. To guarantee all features are parsed and combined correctly, the entire file needs to be scanned and loaded. For large files, we require strategies to load the data without abusing all available memory.
In some cases, it is known that parsing the file in chunks will not result in any information being lost. GFF files produced by SOLiD for short read alignments are one common case. These are read based, non-nested, and quite large. To tackle these files, the parser now has an iterator based interface that can be used to iterate over sections of the file:
from BCBio.GFF.GFFParser import GFFAddingIterator gff_iterator = GFFAddingIterator() for rec_dict in gff_iterator.get_features(gff_file, target_lines=3000000): for rec in rec_dict.values(): # deal with rec.features
This parses a file into ~350MB sized pieces, returning a dictionary of Biopython
SeqRecord objects keyed by their names. Each
SeqRecord contains all of the features added from that chunk of the file. These can be persisted to a database or otherwise analyzed before proceeding on to the next chunk, keeping memory requirements more reasonable.
The file can still be filtered by feature type, allowing extraction of only features of interest. This example uses the Biopython SeqIO interface to parse an initial FASTA file with our sequences, and then adds coding sequences on chromosome one to it in chunks of 1 million lines:
from BCBio.GFF.GFFParser import GFFAddingIterator from Bio import SeqIO with open(seq_file) as seq_handle: seq_dict = SeqIO.to_dict(SeqIO.parse(seq_handle, "fasta")) gff_iterator = GFFAddingIterator(seq_dict) cds_limit_info = dict( gff_types = [('Coding_transcript', 'gene'), ('Coding_transcript', 'mRNA'), ('Coding_transcript', 'CDS')], gff_id = ['I'] ) for rec_dict in gff_iterator.get_features(gff_file, limit_info=cds_limit_info, target_lines=1000000): for rec in rec_dict.values(): # deal with rec.features
To avoid missing nested features, the iterator makes smart decisions about when to break the file. It is broken at points where all child features have their parents and can be expected to be nested correctly.
Nested features for GFF2
Nesting of features is handled nicely in the new GFF3 format. However, many sources provide information in the older GFF2 (also called GTF) format, which has a variety of nesting schemes. The test examples for the parser contain some examples of these from different online repositories:
- Ensembl GFF2 — recognizes child features by a
transcript_idattribute, and does not provide a parent feature
- WormBase GFF2 — child features have a
Transcriptattribute for certain feature types; a parent feature is present, also with a
- JGI GFF2 — child features have a
ProteinIdattribute and no parent feature
The updated parser handles all these styles of nesting, building a top level feature for those files where this parent is not present. This mimics the new GFF3 behavior to ease the transition to those files. Where parent features are missing, a new feature is created of type
inferred_parent which spans the distance of all child features. These child features are available from the
sub_features attribute of the parent.
These new updates improve parsing for older GFF2 files which are still widely used, and opens up parsing to new GFF files produced from SOLiD machines. The code is available from the standard github location. Please continue to pass along bug reports and suggestions.