Blue Collar Bioinformatics

Note: new posts have moved to Please look there for the latest updates and comments

Posts Tagged ‘screen_scraping

Location and duplication information from Ensembl

with 2 comments

Understanding the evolutionary history of a gene family may provide some insight into the mechanisms of its activity. Here we will look at two characteristics for a list of eukaryotic genes of interest:

  • Groups that are co-located on a chromosome.
  • Groups evolutionarily related through duplication (paralogs).

By expanding on work done in an earlier post, we will identify co-located and duplicated genes using information from the Ensembl genome browser.

Using our previous method of screen scraping with Beautiful Soup, we extend our Ensembl REST-like class to include two new functions. Location is represented as a list of chromosome name, start and end, and parsed from the Gene Summary page:

def location(self, organism, gene_id):
    with self._get_open_handle("Gene", "Summary",
            organism, gene_id) as in_handle:
        soup = BeautifulSoup(in_handle)
        loc_tab = soup.find("dd", id="tab_location")
        link = loc_tab.find("a")
        path, attrs = urllib2.splitattr(link["href"])
        for attr in attrs:
            if attr.find("r=") == 0:
                key, val = attr.split("=")
                chrom, location = val.split(":")
                start, end = location.split("-")
                return chrom, int(start), int(end)
        raise ValueError("Did not find location: %s" % link)

Similarly, we can retrieve the details on paralogs from the Ensembl Comparative pages for the genes. With these two functions, a list of the protein IDs of interest, and a dictionary containing species and Ensembl ID references, we collect all of the location and duplication information into python dictionaries. At the same time we maintain a backwards mapping of Ensembl IDs to our original ID:

ensembl_retriever = EnsemblComparaRest(cache_dir)
loc_info = dict()
dup_info = dict()
ensembl_to_uniprot = dict()
for cur_id in all_ids:
    cur_rec = db[cur_id]
    cur_ensembl_org = cur_rec["org_scientific_name"].replace(" ", "_")
    for ensembl_id in cur_rec.get("db_refs_ensembl", []):
        paralogs = ensembl_retriever.paralogs(cur_ensembl_org, ensembl_id)
        chromosome, start, end = ensembl_retriever.location(
                cur_ensembl_org, ensembl_id)
        dup_info[ensembl_id] = paralogs
        loc_info[ensembl_id] = (cur_rec["org_scientific_name"],
            chromosome, start, end)
        ensembl_to_uniprot[ensembl_id] = cur_id

Now we want to flatten the paralog dictionary into a list of groups associated by duplication. This is done by using python sets; any groups with shared genes are combined and the resulting unique list is returned:

def examine_paralogs(dup_info, ensembl_to_uniprot):
    cur_groups = []
    all_base = dup_info.keys()
    for base_id, dup_ids in dup_info.items():
        overlap = set(dup_ids) & set(all_base)
        if len(overlap) > 0:
            new_group = set([ensembl_to_uniprot[x] for x in overlap |
            is_unique = True
            for exist_i, exist_group in enumerate(cur_groups):
                if len(new_group & exist_group) > 0:
                    update_group = new_group & exist_group
                    cur_groups[exist_i] = update_group
                    is_unique = False
            if is_unique:
    return [list(g) for g in cur_groups]

We use two functions to similarly group together genes by location. The first function calculates the actual distance given the previously retrieved location information list of (organism, chromosome, start, end). Two items in different organisms or on different chromosomes are placed maximally far apart since they can’t be co-located.

def location_distance(loc_one, loc_two):
    if loc_one[:2] != loc_two[:2] or loc_one == loc_two:
        return sys.maxint
        return max(abs(loc_one[3] - loc_two[2]),
                   abs(loc_two[3] - loc_one[2]))

The next function creates a dictionary of genes co-located based on our threshold, and then uses the examine_paralogs flattening function we defined above to generate groups from this dictionary. The variable distance_thresh defines the distance in which two genes are considered co-located. For the example script, 1 megabase is used, but this can be adjusted according to your own personal definition of close.

def examine_location(loc_info, ensembl_to_uniprot, distance_thresh):
loc_close = collections.defaultdict(lambda: [])
for uniprot_id, loc_one in loc_info.items():
for cmp_id, loc_two in loc_info.items():
if location_distance(loc_one, loc_two) <= distance_thresh: loc_close[uniprot_id].append(cmp_id) return examine_paralogs(loc_close, ensembl_to_uniprot) [/sourcecode]

The full script takes a file input with each line being a uniprot ID and gene name, and also requires a UniProt shelve database like we developed earlier. This shelve database provides the base ID to ensembl ID mappings and organism information, which can be parsed from UniProt XML files.

The resulting co-location and duplication groups are sets of genes which may share an intriguing evolutionary history. The paralogs have been computationally determined to be evolutionarily related based on sequence similarity and the composition of their gene family tree. The co-located genes may be co-selected during evolution, be the result of localized gene duplication obscured by extensive sequence change, or simply be close together based on chance. Further examination of your gene family in light of this information can help determine which of these hypotheses to favor.

Written by Brad Chapman

January 31, 2009 at 3:11 pm

Comparative genomics information retrieval from Ensembl

leave a comment »

The Ensembl website provides a powerful front end to genomic data from a wide variety of eukaryotic species. Additionally, the Comparative Genomics initiative provides automated phylogenetic analyses for comprehensive examination of gene families. This post describes retrieving comparative protein details for a human gene of interest using the tried and true method of screen scraping, presenting the data in a method ready to be presented via a REST interface.

Several other people have had similar notions for retrieving Ensembl data. Pedro describes an example using openKapow, and Andrew uses Dapper.

Here, we deal with the Ensembl web pages using Beautiful Soup, a Python web parser that simplifies the work of web page retrieval. The idea is to generate an interface that could be readily abstracted to a set of REST web pages. This would greatly simplify information retrieval from Ensembl; retrieving a set of orthologs for a gene ID would involve a workflow like:

  • Prepare a URL like:
  • Parse out a simple text result in CSV:

For queries that can be expressed with a few inputs and readily understandable outputs, this would provide programmatic access to Ensembl data without the overhead of installing the Perl API. Below is a function which retrieves the pages, parses them with Beautiful Soup, and returns the simplified information. To wrap this into a REST interface described above, would require adding a layer on top using a Python web framework like Pylons.

def orthologs(self, organism, gene_id):
    """Retrieve a list of orthologs for the given gene ID.
    orthologs = []
    with self._get_open_handle("Gene", "Compara_Ortholog",
            organism, gene_id) as in_handle:
        soup = BeautifulSoup(in_handle)
        orth_table = soup.find("table", "orthologues")
        orth_links = orth_table.findAll("a", 
                href = re.compile("Gene/Summary"))
        for orth_link in orth_links:
            href_parts = [x for x in orth_link['href'].split('/') if x]
            orthologs.append((href_parts[0], orth_link.string))
    return orthologs

The full example script takes an organism and gene ID as input and displays homologs in Ensembl along with distance from the initial organism, protein domains, and features of the protein. The script uses Biopython, the Newick Tree Parser to parse phylogenetic trees, and NetworkX to calculate phylogenetic distances from the tree.

> python2.5 Homo_sapiens ENSG00000173894
Homo_sapiens 0 [u'IPR016197', u'IPR000953'] 38.0
Pan_troglodytes 0.009 [u'IPR016197', u'IPR000637', u'IPR000953'] 38.0
Gorilla_gorilla 0.0169 [u'IPR016197', u'IPR000637', u'IPR000953'] 
Macaca_mulatta 0.0538 [u'IPR000637', u'IPR000953'] 36.0
Tarsius_syrichta 0.1622 [u'IPR000637'] 
Microcebus_murinus 0.1848 [u'IPR000637', u'IPR000953'] 33.5

This demonstrates using the framework to look at the change in domains and protein charge across a gene family. The general idea of the EnsemblComparaRest< class could be applied to other information of interest available from Ensembl web pages.

Written by Brad Chapman

December 28, 2008 at 10:14 pm