Email Contact: bioie@ldc.upenn.edu
This material is based upon work supported by the National Science Foundation under Grant No.: EIA-0205448
Home
      -Overview
      -About Us/Credits
Resources
      -Data
      -Publications
      -Works of Interest
Documentation
      -User Guide
      -Guided Tour
      -Description of Data
      -Software/Tools
      -Docs for Annotators
Software/Tools
      -WordFreak
      -LAW Workflow System
      -Annotation Database
      -Auto Text Processing
      -TreeEditor
      -Taggers
Archive Releases
 
Printer Friendly Version
 
Mining the Bibliome
Overview for Mining the Bibliome
 
    

Contents

  1. Overview:
    The overview section for "Mining the Bibliome" briefly describes this entire project involving the extraction of biomedical information.

  2. Resources:
    The resources section describes the data, lists works, and publications that users may find useful.

  3. Documentation:
    The documentation section contains all recorded information for this project including the guided tour, the description of data, and software/tools. It also includes specific documentation for annotators, containing sections in annotator policies, contact information, guidelines to annotation, as well as archived emails pertaining to questions and discussions brought up, respectively, within each annotation project.

  4. Software/Tools:
    The Software/Tools section contains links to the software/tools, describes what software is needed, and how to setup, run, and use each respective software/tool required in this entire project.

  5. Archive Releases:
    The archive releases section contains release versions of each stage of development for the entire project.




Overview

Abstract:   Our goal is qualitatively better methods for automatically extracting information from the biomedical literature, relying on recent progress and new research in three areas: high-accuracy parsing, shallow semantic analysis, and integration of large volumes of diverse data. We are focusing initially on two applications: drug development, in collaboration with researchers in the Knowledge Integration and Discovery Systems group at GlaxoSmithKline, and pediatric oncology, in collaboration with researchers in the eGenome group at Children's Hospital of Pennsylvania. These applications, worthwhile in their own right, provide excellent test beds for broader research efforts in natural language processing and data integration.

In particular, we propose to develop and test new general methods for information extraction from text, based on on-going research at Penn in corpus-based algorithms for parsing, predicate-argument analysis and reference resolution. We will collaborate with groups led by Robert Gaizauskas at the University of Sheffield (UK) and by Jun-ichi Tsujii at the University of Tokyo (Japan), as well as with the GSK and CHOP groups, in applying these general methods to particular problems in biomedical information extraction. The GSK group has already made effective use of the best available information-extraction technology, so that the new techniques can be assessed in the drug-development application for their added value relative to the state of the art.

To give a simple, concrete example, we want a program that will read a phrase like

Amiodarone weakly inhibited CYP2C9, CYP2D6, and CYP3A4-mediated activities with Ki values of 45.1--271.6 μM

and add to a database a set of entries whose ordinary-language presentation is

amiodarone inhibits CYP2C9 with Ki=45.1--271.6
amiodarone inhibits CYP2D6 with Ki=45.1--271.6
amiodarone inhibits CYP3A4 with Ki=45.1--271.6

This project will also address several database research problems, including methods for modeling complex, incomplete and changing information using semistructured data, and also ways to connect the text analysis process to an information integration environment that can deal with the wide variety of extant bioinformatic data models, formats, languages and interfaces. Such problems are central to current database research at Penn. Our work will build on the progress represented by Penn's K2/Kleisli data integration environment, by extending it to deal with semi-structured data. This will be important managing the information that is extracted, and also in making use of the many specialized data resources that can be brought to bear in generating the high-accuracy text analysis required to extract the information effectively in the first place. Because the K2/Kleisli system is widely used in bioinformatics applications, and in particular has been used for several years at GSK, these extensions will facilitate collaboration with biomedical researchers as well as supporting the IE research itself.

The engine of recent progress in language processing research has been linguistic data: text corpora, treebanks, lexicons, test corpora for information retrieval and information extraction, and so on. Much of this data has been created by Penn researchers and published by Penn's Linguistic Data Consortium. As part of the proposed project, we will develop and publish new linguistic resources in three categories: a large corpus of biomedical text annotated with syntactic structures (Treebank) and shallow semantic structures ("proposition bank" or Propbank); a large set of biomedical abstracts and full-text articles annotated with entities and relations of interest to researchers, such as enzyme inhibition, or mutation/cancer connections (Factbanks); and broad-coverage lexicons and tools for the analysis of biomedical texts.

A sister project entitled "Language, Learning and Modeling Biological Sequences", funded via NSF ITR award EIA-0205456, is also underway at IRCS.