Extracting keywords from biological text using Zemanta

Increasingly, my daily work is shifting from a model of “let me do this analysis for you and give you back some data” to “let me provide an interface that allows you to do the analyses yourself.” This is great as it allows more close collaboration with lab scientists, and also helps split up work so I can be involved in more projects. One common interface suggestion is a keyword google-like search box; enter some text and find anything related to this term. In implementing this, I wanted to provide reasonable search suggestions by identifying keywords from gene descriptions. These can help frame researchers questions, and prove clues about useful search terms for new users.

Here is an implementation of keyword extraction from biological text using the Zemanta semantic API. The function uses the Zemanta REST interface and parses JSON output with simplejson. The parsed JSON is available as a python dictionary.

def query_zemanta(search_text):
    gateway = 'http://api.zemanta.com/services/rest/0.0/'
    args = {'method': 'zemanta.suggest',
            'api_key': 'YOUR_API_KEY',
            'text': search_text,
            'return_categories': 'dmoz',
            'return_images': 0,
            'return_rdf_links' : 1,
            'format': 'json'}
    args_enc = urllib.urlencode(args)

    raw_output = urllib2.urlopen(gateway, args_enc).read()
    output = simplejson.loads(raw_output)

    print 'First article:', output['articles'][0]
    print 'Keywords:', [k['name'] for k in output['keywords']]
    for link in output['markup']['links']:
        print link['anchor']
        for target in link['target']:
            if target['type'] in ['wikipedia', 'rdf']:
                print '\t', target['title'], target['url']
    #print output
    print 'Marked up text', output['markup']['text']

In addition to extracting keywords, Zemanta also provides links to online resources. Here is the keyword list:

Keywords: [u'Drosophila', u'Biology', u'Messenger RNA', u'Germ cell',
           u'Mitogen-activated protein kinase', u'Oocyte', 
           u'C-Jun N-terminal kinases', u'RNA']

And here is the original testing text marked up with the automated links:

glh-2 encodes a putative DEAD-box RNA helicase that contains six CCHC zinc fingers and is homologous to Drosophila VASA, a germ-line-specific, ATP-dependent, RNA helicase; GLH-2 activity may also be required for the wild-type morphology of P granules and for localization of several protein components, but not accumulation of P granule mRNA components; GLH-2 interacts in vitro with itself and with KGB-1, a JNK-like MAP kinase; GLH-2 is a constitutive P granule component and thus, with the exception of mature sperm, is expressed in germ cells at all stages of development; GLH-2 is cytoplasmic in oocytes and the early embryo, while perinuclear in all later developmental stages as well as in the distal and medial regions of the hermaphrodite gonad; GLH-2 is expressed at barely detectable levels in males.

In addition to the keywords, Zemanta does an excellent job of automatically annotating the text with links to relevant resources:

Most impressively, the JNK acronym is determined to reference C-Jun N-terminal kinases, providing a link to the Wikipedia reference.
Zemanta also provides links to Freebase, an open database in RDF-ready format. One example is this useful link to Drosophila from which you could automatically extract NCBI Taxon IDs.
The one automated semantic mistake is the link to KGB; it provides a link to the KGB headquarters in Russia.

Zemanta also supports links to NCBI in their latest release (as a funny semantic miscue, the blog link to NCBI is to the wrong NCBI). I did not get any NCBI links with this few example, but it is exciting to see they are thinking about scientific applications.

In addition to Zemanta, I also tested OpenCalais with the python interface, which was not as successful. For the same text above, it returned only a single keyword: the incorrect KGB one. It appears as if their current focus is in finance, but it is worth watching for future developments.

Written by Brad Chapman

January 4, 2009 at 1:15 am

Posted in semanticweb

Tagged with bioinformatics, python, semanticweb

7 Responses

Subscribe to comments with RSS.

Hi Brad,

I am Andraz Tori, CTO at Zemanta.

I am very glad that you tested it out and liked it! We hope to improve it even further (and fix the KGB stuff too :)

Are you planning to use it in any way/service. If you need any additional information or help, please do let me know! Also if you have any ideas about Zemanta, please do let me know!

bye
Andraz Tori, Zemanta

Andraz Tori

January 5, 2009 at 11:05 am

Reply
- Andraz;
  Thanks for the note. I can’t emphasize enough how impressed I was with the Wikipedia extraction of the cryptic domain-specific acronyms.
  
  At Mass General Hospital, we are developing a local interface to data available from the model organism C elegans. The initial plans for Zemanta usage are rather modest; likely adding the keyword searching and markup described in the post.
  
  Longer term, I think there is the possibility for data mining that combines the types of descriptions I used in the example with the semantic details Zemanta provides. We may be able to automatically provide links between genes of interest that are not readily inferred from the descriptions themselves. I’m sure that will be another longer post in the future.
  
  I’m looking forward to following the developments at Zemanta and am excited you are taking an interest in biological data. Thanks again,
  Brad
  
  Brad Chapman
  
  January 6, 2009 at 2:42 am
  
  Reply
I am very glad to hear that!

If you have any questions, do not hesitate to mail me directly and whenever you have anything to show, let me know. We are also interested in promoting things that other people build with the help of Zemanta API.

And don’t forget to join the developer’s mailing list.

bye
Andraz Tori

Andraz Tori

January 14, 2009 at 11:03 am

Reply
Brilliant stuff. This is bringing me some ideas for integration into a project I’m currently working on. Great work.

Ricardo

January 19, 2009 at 11:36 am

Reply
- Ricardo — very happy to hear it was useful. I also expanded on this quite a bit in a post I just finished: Finding proteins with related function using semantic clustering.
  
  Brad
  
  Brad Chapman
  
  January 19, 2009 at 12:43 pm
  
  Reply
fantastic post, I must accept you are writing out of box examples, I am really impressed with this blog, Hope I can learn a lot

Abhishek Tiwari

January 30, 2009 at 11:27 am

Reply
[…] interesting usage was presented by Brad Chapman at Blue Collar Bioinformatics blog. He’s extracting keywords from biological text using Zemanta API. He’s sharing his example python code on how to do it and explains in more details […]

Zemanta API: Bukisa and Biological text extraction | Zemanta Ltd.

February 2, 2009 at 10:11 am

Reply

Blue Collar Bioinformatics