Blue Collar Bioinformatics

Posts Tagged ‘biopython

Biopython projects for Google Summer of Code 2010

with 5 comments

Google Summer of Code provides the unique opportunity for students to spend a summer working on open source projects and getting paid. Biopython was involved with two great projects last summer, and it’s time to apply for this year’s program: the student application period is from next Monday, March 29th to Friday, April 9th, 2010.

If you are a student interested in biology and open source work, there are two community organizations to look at for mentors and project ideas:

  • NESCent Phyloinformatics — NESCent is a GSoC mentoring organization for the 4th year, focusing on projects related to phylogenetics and open source code.
  • Open Bioinformatics Foundation — The umbrella organization that manages BioPerl, Biopython, BioJava, BioRuby and several other popular open source bioinformatics projects is involved with GSoC for the first time.

This year, I’ve collaborated on three project ideas centering around the idea of tool integration. An essential programming skill for dealing with large heterogeneous data sets is combining a set of tools in a way that abstracts out the implementation details, instead allowing you to focus on the high level biological questions. Bradford Cross, a machine learning and data crunching expert at FlightCaster, describes this process brilliantly in an interview at Data Wrangling.

These three project ideas allow a student to develop essential toolkit integration skills, while having the flexibility to work on biological questions relevant to their undergrad or graduate research:

All involve taking two or more different toolkits and combining the functionality into a higher level interface focused around ease of use. They are intentionally broad and flexible ideas, and a student proposal should concentrate on functionality most relevant to their biological questions. Ideally the work would be both a publicly available resource, and contribute directly to the student’s daily research.

If you’re interested in these ideas and in working with a set of great mentors, definitely get in touch with me either through the project mailing lists or directly. If none of these ideas strike your fancy but you would like to be involved with GSoC, get in touch with a mentor from one of the other project ideas at NESCent and OpenBio. It’s a unique opportunity to develop new coding skills, work with great mentors, and give back to the open source community.

Written by Brad Chapman

March 26, 2010 at 8:02 am

Posted in OpenBio

Tagged with , ,

Python query interface to BioGateway SPARQL endpoint and InterMine

with one comment

I spent the last week in Tokyo, Japan at BioHackathon 2010, an extremely productive meet up of open source coders organized by Toshiaki Katayama and the great folks at the Database Center of Life Sciences (DBCLS). The focus of the week was improving biological toolkits for accessing the Semantic Web of linked data.

The technical focus was on RDF (Resource Description Format), a standard way to represent data as a triple; each triple is made up of a subject defining the item, a predicate specifying a relationship, and a object representing the linked data. By providing data in RDF along with common naming schemes for common objects we facilitate linking biological data in mashups, expanding the ability of researchers to discover relationships between disparate data sources.

RDF is managed in data stores like Virtuoso, which are equivalent to relational or document based databases. For programmers, the primary technology for querying these stores is SPARQL, a query language similar to SQL. The goal of the Biopython programming team at the Hackathon was to provide an easy to use Python library to query RDF stores through SPARQL.

Interface design

The python interface organizes common biological queries on a datastore without exposing the backend implementation details. The final interface simplifies the process of making a query to two steps:

  • Instantiate a query builder, providing it with two sets of data: attributes to retrieve and items to filter against. This is modeled after BioMart querying and the R biomaRt interface, providing a generic, well understood way to specify the query.
  • Pass the query builder to a retrieval server which executes the query and returns the results in a tabular format, as a numpy RecordArray.

The user is never exposed to the underlying implementation details, as the library performs the work of building the query, submitting it to the remote server and reformatting the results.

Two interfaces were implemented at BioHackathon 2010:

  • BioGateway — a SPARQL endpoint to a RDF data store containing SwissProt data semantically linked to Gene Ontology (GO) terms.
  • InterMine — a XML based interface to a traditional relational database backend, containing organized metadata for primary experimental data from many organisms.

By providing a common interface for both semantic data and more traditional data sources, we hope to facilitate the conversion by data providers to RDF where it simplifies their backend storage and queries. Users of this high level interface do not need to worry about the underlying implementation, instead focusing resources on developing their biological queries.

BioGateway

BioGateway organizes the SwissProt database of protein sequences along with Gene Ontology Annotations (GOA) into an integrated RDF database. Data access is provided through a SPARQL query endpoint, allowing searches for proteins based on a combination of GO and SwissProt data.

This query searches for proteins that are involved in insulin response and linked to diabetes. The protein name, other proteins associated via protein-protein interaction, and the gene name are retrieved:

from systemsbio import Biogateway, UniProtGOQueryBuilder
 
builder = UniProtGOQueryBuilder("Homo sapiens")
builder.add_filter("GO_term", "insulin")
builder.add_filter("disease_description", "diabetes")
builder.add_attributes(["protein_name", "interactor", "gene_name"])
        
server = Biogateway()
results = server.search(builder)
print len(results), results.dtype.names
result = results[0]
print result['protein_name'], result['gene_name'], \
      result['interactor'], result['GO_term']

An orthogonal search approach is to start with a protein of interest and retrieve linked details. Here we identify primary journal papers about a protein:

from systemsbio import Biogateway, ReferenceBuilder
 
builder = ReferenceBuilder()
builder.add_filter("protein_id", "1433B_HUMAN")
builder.add_attributes(["reference"])
 
server = Biogateway()
results = server.search(builder)
print len(results), results.dtype.names
result = results[0]
print result['protein_id'], result['reference']

Using the associated PubMed IDs, we can retrieve the paper details using Biopython and NCBI Entrez, utilizing BioGateway as part of an automated analysis pipeline:

from Bio import Entrez
Entrez.email = "me@example.com"
pubmed_id = result['reference'].replace("PMID_", "")
handle = Entrez.esummary(db="pubmed", id=pubmed_id)
record = Entrez.read(handle)[0]
print record['Title']
print record['PubDate']
print ",".join(record['AuthorList'])
print record['FullJournalName'], record['Volume'], record['Pages']
# Novel raf kinase protein-protein interactions found by an exhaustive yeast two-hybrid analysis.
# 2003 Feb
# Yuryev A,Wennogle LP
# Genomics 81 112-25

The full source code is available from GitHub: systemsbio.py. The implementation builds a SPARQL query based on the provided attributes and filters, using SPARQLwrapper to interact with the remote server and parse the results.

InterMine

InterMine is an open source data management system used to power databases of primary research results like FlyMine and modMine. It stores metadata associated with projects in a structured way, enabling searches to identify data submissions of interest to biologists. It contains two useful web based tools to facilitate these searches:

  • Templates — Pre-defined queries that capture common ways biologists search the database.

  • Query builder — A graphical interface to define custom queries, allowing manual discovery of attributes of interest.

We access InterMine programatically using the same builder and server paradigms used in our BioGateway interface. The query below searches modMine for C. elegans experiments characterizing the H3K4Me3 histone modification, which is associated with chromatin structure in active genes. The returned submission identifiers can be used to examine the primary data associated with the experiment:

from intermine import Intermine, SubmissionQueryBuilder
builder = SubmissionQueryBuilder()
builder.add_attributes(["submission_id",
    "submission_title", "developmental_stage"])
builder.add_filter("organism", "Caenorhabditis elegans")
builder.add_filter("antibody_name", "H3K4me3")
 
server = Intermine("http://intermine.modencode.org")
table = server.search(builder)
print table.dtype.names
print table
# ('submission_id', 'submission_title', 'developmental_stage')
# [('176', 'Histone_H3K4me3_from_N2_L3_larvae_(AR0169_H3K4me3_N2_L3)', '')
# ('2311', 'Histone_H3K4me3_N2_EEMB_(WA30534819_H3K4ME3_N2_EEMB)',
# 'Early Stage Embryos')
# ('2410', 'Histone_H3K79me1_N2_EEMB_(AB2886_H3K79ME1361912_N2_EEMB)',
# 'Early Stage Embryos')]

An advantage of defining query builders is that we can provide custom functionality to access more complex queries. The code below searches for C. elegans ChIP-seq experiments using a free text search. The implementation searches for the query term against several description fields in the database, hiding these details from the user:

from intermine import Intermine, ExperimentQueryBuilder
builder = ExperimentQueryBuilder()
builder.add_attributes(["submission_id", "experiment_name"])
builder.add_filter("organism", "Caenorhabditis elegans")
builder.free_text_filter("ChIP-seq")
    
server = Intermine("http://intermine.modencode.org")
table = server.search(builder)
print table.dtype.names
print table
# ('submission_id', 'experiment_name')
# [('582', 'ChIP-Seq Identification of C. elegans TF Binding Sites')
# ('584', 'ChIP-Seq Identification of C. elegans TF Binding Sites')
# ...

The full source code is available from GitHub: intermine.py. It builds a XML query for submission to InterMine Web Services, handling submitting and parsing the result.

Thoughts

It is not a coincidence that a diverse set of tools like InterMine, BioGateway and BioMart were used in building these interfaces. The collaborative environment at BioHackathon 2010 facilitated productive discussions with the authors of these projects, leading to the API development and implementation. If you are interested in more details about the week of programming, check out the day to day summaries:

You are invited to fork and extend the code on GitHub.

Written by Brad Chapman

February 15, 2010 at 11:33 am

Automated retrieval of expression data with Python and R

with 10 comments

In February, I’ll be headed to Japan for BioHackathon 2010, which focuses on integrating biological datasets. In preparation, I’ve been thinking about use cases: when working with a set of data, what other information do you want to able to readily retrieve that will help you better answer your questions? One common set of data is gene expression information; you identify a group of genes of interest and are curious about their expression levels.

In an ideal semantic world, you would be able to build a query through a set of RDF triples like:

  • Expression set -> has organism -> Mus musculus
  • Mus musculus -> has cell type -> proB
  • proB -> has biological state -> wild type
  • Expression set -> has identifier -> A RefGene ID
  • A RefGene ID -> has expression value -> 7.65

So you could find expression sets unique to your cell type under standard conditions, and then identify the expression values of your genes. Ideally, expression sets would also have additional meta-data associated with them so you’d know what 7.65 meant in the context of the experiment.

Although we can’t do this query through semantic channels, there are a number of toolkits that help us answer the question in an automated way. NCBI’s Gene Expression Omnibus (GEO) provides an online resource hosting expression data. You can query the data sets with EUtils web API using Biopython. Once a data set of interest is identified, Bioconductor’s GEOquery can retrieve the data sets and mappings to UCSC RefGene identifiers. All of this is tied together with the Rpy2 python interface to R.

At the top level, we define our goal, which is to retrieve expression data for mouse proB wild type cells:

def main():
    organism = "Mus musculus"
    cell_types = ["proB", "ProB", "pro-B"]
    email = "chapmanb@50mail.com"
    save_dir = os.getcwd()
    exp_data = get_geo_data(organism, cell_types, email, save_dir,
        _is_wild_type)

def _is_wild_type(result):
    """Check if a sample is wild type from the title.
    """
    return result.samples[0][0].startswith("WT")

We query GEO for the possible data sets based on our organism and cell type, and then search through the results for a dataset that matches our criteria. In real life work, your personal _is_wild_type function may be an interactive dialog that presents the biologist with possible experiments, allowing them to select the one of interest. Note also that we store our final result locally, as a Python pickle. Practically, this reduces our load on external servers by only querying them once:

def get_geo_data(organism, cell_types, email, save_dir, is_desired_result):
    save_file = os.path.join(save_dir, "%s-results.pkl" % cell_types[0])
    if not os.path.exists(save_file):
        results = cell_type_gsms(organism, cell_types, email)
        for result in results:
            if is_desired_result(result):
                with open(save_file, "w") as out_handle:
                    cPickle.dump(result, out_handle)
                break

    with open(save_file) as save_handle:
        result = cPickle.load(save_handle)

The hard work of querying GEO and retrieving the results is done by Biopython’s Entrez interface. After building up the query, the results are parsed into simple objects with a description of the expression set along with titles and identifiers for each of the samples that match our cell type:

def cell_type_gsms(organism, cell_types, email):
    """Use Entrez to retrieve GEO entries for an organism and cell type.
    """
    Entrez.email = email
    search_term = "%s[ORGN] %s" % (organism, " OR ".join(cell_types))
    print "Searching GEO and retrieving results: %s" % search_term
    
    hits = []
    handle = Entrez.esearch(db="gds", term=search_term)
    results = Entrez.read(handle)
    for geo_id in results['IdList']:
        handle = Entrez.esummary(db="gds", id=geo_id)
        summary = Entrez.read(handle)
        samples = []
        for sample in summary[0]['Samples']:
            for cell_type in cell_types:
                if sample['Title'].find(cell_type) >= 0:
                    samples.append((sample['Title'], sample['Accession']))
                    break
        if len(samples) > 0:
            hits.append(GEOResult(summary[0]['summary'], samples))
    return hits

Once the experiment is selected we’ve accomplished the first part of our goal. Next, the plan is to get expression data values. We do this by building up a dictionary that maps RefGene IDs to expression values, so the results look like:

WT ProB, biological rep1 
[('NM_177327', [7.7430266269999999, 6.4795213670000003, 8.8766985500000004]), 
 ('NM_001008785', [7.4671954649999996, 5.4761453329999998]),
 ('NM_177325', [7.3312364040000002, 11.831475960000001]),
 ('NM_177323', [6.9779868059999997, 6.3926399939999996]),
 ('NM_177322', [5.0833683379999997])]

The actual hard work is retrieving the expression data and mappings to gene identifiers, and this is done using the GEOquery library in Bioconductor. Mapping information for expression values, and high level meta-data about the experiment, are stored in local files. The expression information is stored as a R style tab delimited data table, while the meta-data is stored in JSON format. Both sets of data are then present locally in readily readable formats for further processing:

def _write_gsm_map(self, gsm_id, meta_file, table_file):
    """Retrieve GEO expression values using Bioconductor, saving to a table.
    """
    robjects.r.assign("gsm.id", gsm_id)
    robjects.r.assign("table.file", table_file)
    robjects.r.assign("meta.file", meta_file)
    robjects.r('''
      library(GEOquery)
      library(rjson)
      gsm <- getGEO(gsm.id)
      write.table(Table(gsm), file = table.file, sep = "\t", row.names = FALSE,
                  col.names = TRUE)
      cat(toJSON(Meta(gsm)), file = meta.file)
    ''')

The previous function provides a mapping between probe names and expression levels. To be useful, we also need to map each of the probes on the expression array to biological gene identifiers. This is done through a second call to the GEO library to retrieve details for the expression platform. Again the data is presented in a R data frame, which we limit to only the items of interest and save as a tab delimited file:

def _write_gpl_map(self, gpl_id, gpl_file):
    """Retrieve GEO platform data using R and save to a table.
    """
    robjects.r.assign("gpl.id", gpl_id)
    robjects.r.assign("gpl.file", gpl_file)
    robjects.r('''
      library(GEOquery)
      gpl <- getGEO(gpl.id)
      gpl.map <- subset(Table(gpl), select=c("ID", "RefSeq.Transcript.ID"))
      write.table(gpl.map, file = gpl.file, sep = "\t", row.names = FALSE,
                  col.names = TRUE)
    ''')

Putting this all together, we download each of the mapping files and then parse them, building the connection: expression to probe ID to gene transcript IDs. The final dictionary maps these gene identifiers to the expression values and returns them:

def get_gsm_tx_values(self, gsm_id, save_dir):
    """Retrieve a map of transcripts to expression from a GEO GSM file.
    """
    gsm_meta_file = os.path.join(save_dir, "%s-meta.txt" % gsm_id)
    gsm_table_file = os.path.join(save_dir, "%s-table.txt" % gsm_id)
    if (not os.path.exists(gsm_meta_file) or 
            not os.path.exists(gsm_table_file)):
        self._write_gsm_map(gsm_id, gsm_meta_file, gsm_table_file)

    with open(gsm_meta_file) as in_handle:
        gsm_meta = json.load(in_handle)
    id_to_tx = self.get_gpl_map(gsm_meta['platform_id'], save_dir)
    tx_to_vals = collections.defaultdict(list)
    with open(gsm_table_file) as in_handle:
        reader = csv.reader(in_handle, dialect='excel-tab')
        reader.next() # header
        for probe_id, probe_val in reader:
            for tx_id in id_to_tx.get(probe_id, []):
                tx_to_vals[tx_id].append(float(probe_val))
    return tx_to_vals

The full script puts all of these parts together into a working example. In the end we go from our query to the expression data we need, although the route is a bit more circuitous then the idealized one described at the start of the post. Simplifying this type of question based approach to data access is one challenge bioinformatics tool developers will continue to face.

Written by Brad Chapman

January 2, 2010 at 5:43 pm

Trimming adaptors from short read sequences

with 11 comments

One common post processing step for next generation, or short read, sequencing is to remove adaptor sequences. Sample preparation for sequencing often involves selection of a particular fraction of interest like expressed genes, microRNAs, or copy number variant regions. Selecting, amplifying, and sequencing these regions can result in read sequences containing a common tag that needs to be identified and removed before continuing with other analysis steps like alignment to the genome. This is often the case when the reads of interest are small like in short RNAs or small sequence tags.

We start with a short known sequence (the adaptor) that will be present in our sequences of interest. For each read, we want to determine the position of this adaptor and remove it. Additionally, reads may contain the adaptor but differ by one or more bases; these differences are likely due to sequencing errors. So our solution needs to allow some fuzzy matching of the adaptor to the read sequences.

My first attempt involved using fuzzy string matching. In Python, several libraries exist to tackle this problem. The normal use case is in identifying misspelled words in text. For instance, you can calculate Levenshtein (edit) distance or use difflib in the standard library; Stack Overflow has a nice discussion of the different modules. Ultimately this approach proved to be a dead end; the properties that make English words similar do not overlap well with those differentiate DNA sequences.

That realization led me to tackle the problem by computing local alignments of the adaptor to each sequence. This effectively captures the biological intuition you need to trim a sequence with less than a specified number of errors. The downside to this rigorous approach is that it will be slower than purely string based methods.

The Biopython library contains a Python local alignment function suitable for quick alignment of short regions. We use this to align an adaptor region to a sequence and calculate the number of differences in the aligned region. To handle a common case where we can find the exact adaptor sequence, we first do an string match. This avoids the alignment overhead in many cases:

from Bio import pairwise2

def trim_adaptor(seq, adaptor, num_errors, right_side=True):
    gap_char = '-'
    exact_pos = str(seq).find(adaptor)
    if exact_pos >= 0:
        seq_region = str(seq[exact_pos:exact_pos+len(adaptor)])
        adapt_region = adaptor
    else:
        seq_a, adaptor_a, score, start, end = pairwise2.align.localms(str(seq),
                str(adaptor), 5.0, -4.0, -9.0, -0.5,
                one_alignment_only=True, gap_char=gap_char)[0]
        adapt_region = adaptor_a[start:end]
        seq_region = seq_a[start:end]
    matches = sum((1 if s == adapt_region[i] else 0) for i, s in
            enumerate(seq_region))
    # too many errors -- no trimming
    if (len(adaptor) - matches) > num_errors:
        return seq
    # remove the adaptor sequence and return the result
    else:
        return _remove_adaptor(seq, seq_region.replace(gap_char, ""),
                right_side)

If the alignment contains fewer than the specified number of errors we’ve found an adaptor and remove it, returning the trimmed sequence. The attribute right_side allows us to specify whether the trimming should be done on the right (3′ end) or the left (5′ end):

def _remove_adaptor(seq, region, right_side=True):
    """Remove an adaptor region and all sequence to the right or left.
    """
    if right_side:
        try:
            pos = seq.find(region)
        # handle Biopython SeqRecords
        except AttributeError:
            pos = seq.seq.find(region)
        return seq[:pos]
    else:
        try:
            pos = seq.rfind(region)
        # handle Biopython SeqRecords
        except AttributeError:
            pos = seq.seq.rfind(region)
        return seq[pos+len(region):]

Here is an example script using this function along with Biopython to trim reads from a FASTQ format file.Each read is analyzed to determine if it contains the adaptor, with up to 2 errors. If the adaptor is identified and trimmed, the final useful read is written to a FASTA format output file.

from __future__ import with_statement
import sys
import os

from Bio import SeqIO

from adaptor_trim import trim_adaptor

def main(in_file):
    adaptor_seq = "TCGTATGCCGTCTTCTGC"
    num_errors = 2
    base, ext = os.path.splitext(in_file)
    out_file = "%s-trimmed.fa" % (base)
   
    with open(in_file) as in_handle:
        with open(out_file, "w") as out_handle:
            for rec in SeqIO.parse(in_handle, "fastq"):
                trim = trim_adaptor(rec.seq, adaptor_seq, num_errors)
                if len(trim) > 0 and len(trim) < len(rec):
                    rec.letter_annotations = {}
                    rec.seq = trim
                    SeqIO.write([rec], out_handle, "fasta")

You can get the full trimming code, along with tests, from GitHub. The script takes about a half hour to process 15 million 36bp reads. Other current trimming approaches are often integrated into the aligners themselves: for instance, NovoAlign appears to have similar alignment based trimming capabilities.

Written by Brad Chapman

August 9, 2009 at 8:18 pm

Posted in OpenBio

Tagged with , , ,

Biopython projects for Google Summer of Code

leave a comment »

Google Summer of Code is a brilliant program that provides a summer stipend for students to work on open source projects. This year, I have been evaluating Biopython student proposals under the watchful eye of The National Evolutionary Synthesis Center (NESCent). Biopython was lucky enough to have interest from several students, resulting in some excellent proposals. Two of these were selected for funding:

  • Eric Talevich’s proposal on adding support to Biopython for PhyloXML, an XML format for representing phylogenetic trees and associated data. I will be the primary mentor for this project, with Christian Zmasek, the author of PhyloXML, providing plenty of capable secondary mentoring.
  • Nick Matzke’s proposal on Biogeographical Phylogenetics. This is an integrated project that involves extracting biodiversity data from the web, merging it with phylogenetic data, and providing analysis and display libraries. Nick helped assemble a full team of mentors, and I will be providing input on integrating his code with Biopython while learning about Biogeography along the way.

This is my first year working with summer of code, and I could not be more impressed by the professionalism and hard work of the organizers at both Google and NESCent. The program is designed from the ground up to be simultaneously inclusive and rigorously selective. Students are asked to prepare detailed project plans to demonstrate they understand the subject, are able programmers, and can communicate effectively. There is also a strong eye given to students who are likely to continue on in their open source communities after the summer; after all, open source work is really built on the free labor of many dedicated people.

Having worked on open source projects for several years, it is invigorating and heartening to see a program designed to encourage the next generation of open source leaders. My time on the program thus far has been a great learning experience, and I am looking forward to a productive summer.

Written by Brad Chapman

April 20, 2009 at 5:38 pm

Posted in OpenBio

Tagged with , ,

Examining and adjusting your GFF file

with 2 comments

Generic Feature Format (GFF) defines a standard template for representing biological features. Within this template, however, is room for flexibility. Different GFF producers may decide to format their data in slightly different ways. While parsing these files is not a problem, correctly interpreting and utilizing the data can be. The in-development Biopython GFF parser provides utilities to get a high level summary of the elements of a GFF file, and to adjust line items in the file during parsing.

Examining a GFF file

When downloading a new GFF file, the first step is getting an overview of the file contents. The GFF parser provides a GFFExaminer class to help with this. The first function of interest is available_limits:

import pprint
from BCBio.GFF.GFFParser import GFFExaminer
gff_examiner = GFFExaminer()
possible_limits = gff_examiner.available_limits(gff_file)
pprint.pprint(possible_limits)

It returns a dictionary defining various ways to limit your parsing of the GFF file, along with the count of each item. As an example, here is a trimmed dump from one of the test files. This has features on two different record ids — chromosomes ‘I’ and ‘X’, with 159 and 6 items, respectively. Also listed are the combination of the 2nd source column and 3rd type column, and the type column by itself:

{'gff_id': {('I',): 159,
            ('X',): 6},
 'gff_source_type': {('Coding_transcript', 'CDS'): 27,
                     ('Coding_transcript', 'exon'): 33,
                     ('Coding_transcript', 'five_prime_UTR'): 4,
                     ('Coding_transcript', 'gene'): 2,
                     ('Coding_transcript', 'intron'): 29,
                     ('Coding_transcript', 'mRNA'): 4,
                     ('Coding_transcript', 'three_prime_UTR'): 3,
                     ('mass_spec_genome', 'translated_nucleotide_match'): 7},
 'gff_type': {('CDS',): 57,
              ('exon',): 33,
              ('five_prime_UTR',): 4,
              ('gene',): 2,
              ('intron',): 29,
              ('mRNA',): 4,
              ('three_prime_UTR',): 3,
              ('translated_nucleotide_match',): 7}}

In addition to the overview of file contents, nested relationships are another important component of the file to understand. A summary of these is available through the parent_child_map function:

import pprint
from BCBio.GFF.GFFParser import GFFExaminer
gff_examiner = GFFExaminer()
pc_map = gff_examiner.parent_child_map(gff_file)
pprint.pprint(pc_map)

Again, a dictionary is returned. The keys in the dictionary are parent source and type elements in the file, while the values are children of those elements. For instance, here is the dictionary for a three tiered relationship where genes have mRNAs, and each mRNA can have coding regions, exons, introns, and 5′ and 3′ untranslated regions:

{('Coding_transcript', 'gene'): [('Coding_transcript', 'mRNA')],
 ('Coding_transcript', 'mRNA'): [('Coding_transcript', 'CDS'),
                                 ('Coding_transcript', 'exon'),
                                 ('Coding_transcript', 'five_prime_UTR'),
                                 ('Coding_transcript', 'intron'),
                                 ('Coding_transcript', 'three_prime_UTR')]}

Adjusting GFF lines during parsing

Occasionally you may run into a file format that you can comprehend, but that does not match your expectations for where items should be. SOLiD GFF files are one example; many thanks are due to David Schruth, who has been patiently working with me on the parsing of these files. They have the read name in the first column, which is normally used for the record ID the feature maps to. The actual record ID is contained as an attribute, i=1, where the 1 corresponds to the index of record it maps to in the original FASTA alignment file:

3_336_815_F3    solid   read    55409   55428   10.4    +       .       i=1

The GFFAddingIterator has an optional attribute, line_adjust_fn, which can be used to solve this problem. The function is called each time a line is read and passed a parsed dictionary representing the line. The dictionary for the above line is:

{'id': '',
 'is_gff2': False,
 'location': [55408, 55428],
 'quals': {
           'i': ['1'],
           'score': ['10.4'],
           'source': ['solid']},
 'rec_id': '3_336_815_F3',
 'strand': 1,
 'type': 'read'}

The function takes this item as an argument and returns the dictionary, but with any adjustments that need to be made. In our example, we look up the name of the record corresponding to the i=1 index. This name is swapped in for the rec_id, while that information moves to an attribute named read_name:

from Bio import SeqIO
from BCBio.GFF.GFFParser import GFFAddingIterator

class SolidFastaRemap:
    def __init__(self, initial_fasta):
        self._in_map = self._get_index_map(initial_fasta)

    def _get_index_map(self, in_file):
        in_map = dict()
        in_handle = open(in_file)
        for index, rec in enumerate(SeqIO.parse(in_handle, "fasta")):
            in_map[index] = rec.id
        in_handle.close()
        return in_map

    def adjust_fn(self, results):
        # 1-based indexes; convert to 0-based
        rec_index = int(results['quals']['i'][0]) - 1
        read_name = results['rec_id']
        results['quals']['read_name'] = [read_name]
        results['rec_id'] = self._in_map[rec_index]
        return results

remapper = SolidFastaRemap(fasta_file)
gff_iterator = GFFAddingIterator(line_adjust_fn=remapper.adjust_fn)
rec_dict = gff_iterator.get_all_features(gff_file)

This allows you to fix the file during the parsing, saving multiple passes through the file. This general functionality can be used to cleanly deal with any other inconsistencies that crop up during your GFF parsing adventures.

Written by Brad Chapman

April 12, 2009 at 8:55 pm

Posted in OpenBio

Tagged with ,

More python GFF parsing — iterative parsing and GFF2 nested features

leave a comment »

Work on the python generic feature format (GFF) parser continues to push forward; many thanks to those who have provided feedback in helping to refine the functionality. Previously, we discussed the initial implementation, introduced MapReduce parsing for parallelization, and discussed deploying on a cluster and GFF2 parsing. This week describes an interface for iterator based parsing of GFF files and nested features for GFF2 files.

Iterative parsing of GFF

GFF files are line-based and related features can be located anywhere in the file. To guarantee all features are parsed and combined correctly, the entire file needs to be scanned and loaded. For large files, we require strategies to load the data without abusing all available memory.

In some cases, it is known that parsing the file in chunks will not result in any information being lost. GFF files produced by SOLiD for short read alignments are one common case. These are read based, non-nested, and quite large. To tackle these files, the parser now has an iterator based interface that can be used to iterate over sections of the file:

from BCBio.GFF.GFFParser import GFFAddingIterator

gff_iterator = GFFAddingIterator()
for rec_dict in gff_iterator.get_features(gff_file, target_lines=3000000):
    for rec in rec_dict.values():
        # deal with rec.features    

This parses a file into ~350MB sized pieces, returning a dictionary of Biopython SeqRecord objects keyed by their names. Each SeqRecord contains all of the features added from that chunk of the file. These can be persisted to a database or otherwise analyzed before proceeding on to the next chunk, keeping memory requirements more reasonable.

The file can still be filtered by feature type, allowing extraction of only features of interest. This example uses the Biopython SeqIO interface to parse an initial FASTA file with our sequences, and then adds coding sequences on chromosome one to it in chunks of 1 million lines:

from BCBio.GFF.GFFParser import GFFAddingIterator
from Bio import SeqIO

with open(seq_file) as seq_handle:
    seq_dict = SeqIO.to_dict(SeqIO.parse(seq_handle, "fasta"))
gff_iterator = GFFAddingIterator(seq_dict)
cds_limit_info = dict(
        gff_types = [('Coding_transcript', 'gene'),
                     ('Coding_transcript', 'mRNA'),
                     ('Coding_transcript', 'CDS')],
        gff_id = ['I']
        )
for rec_dict in gff_iterator.get_features(gff_file,
        limit_info=cds_limit_info, target_lines=1000000):
    for rec in rec_dict.values():
        # deal with rec.features    

To avoid missing nested features, the iterator makes smart decisions about when to break the file. It is broken at points where all child features have their parents and can be expected to be nested correctly.

Nested features for GFF2

Nesting of features is handled nicely in the new GFF3 format. However, many sources provide information in the older GFF2 (also called GTF) format, which has a variety of nesting schemes. The test examples for the parser contain some examples of these from different online repositories:

  • Ensembl GFF2 — recognizes child features by a transcript_id attribute, and does not provide a parent feature
  • WormBase GFF2 — child features have a Transcript attribute for certain feature types; a parent feature is present, also with a Transcript attribute
  • JGI GFF2 — child features have a TranscriptId or ProteinId attribute and no parent feature

The updated parser handles all these styles of nesting, building a top level feature for those files where this parent is not present. This mimics the new GFF3 behavior to ease the transition to those files. Where parent features are missing, a new feature is created of type inferred_parent which spans the distance of all child features. These child features are available from the sub_features attribute of the parent.

These new updates improve parsing for older GFF2 files which are still widely used, and opens up parsing to new GFF files produced from SOLiD machines. The code is available from the standard github location. Please continue to pass along bug reports and suggestions.

Written by Brad Chapman

April 5, 2009 at 7:49 pm

Posted in OpenBio

Tagged with , ,

Follow

Get every new post delivered to your Inbox.

Join 145 other followers