Contents
- Overview:
The overview section for "Mining the Bibliome" briefly describes
this entire project involving the extraction of biomedical information.
- Resources:
The resources section describes the data, lists works, and
publications that users may find useful.
- Documentation:
The documentation section contains all recorded information for this
project including the guided tour, the description of data, and
software/tools. It also includes specific documentation for annotators,
containing sections in annotator policies, contact information,
guidelines to annotation, as well as archived emails pertaining to
questions and discussions brought up, respectively, within each
annotation project.
- Software/Tools:
The Software/Tools section contains links to the software/tools,
describes what software is needed, and how to setup, run, and use each
respective software/tool required in this entire project.
- Archive
Releases:
The archive releases section contains release versions
of each stage of development for the entire project.
OverviewAbstract: Our goal is qualitatively
better methods for automatically extracting information from the
biomedical literature, relying on recent progress and new research in
three areas: high-accuracy parsing, shallow semantic analysis, and
integration of large volumes of diverse data. We are focusing initially on
two applications: drug development, in collaboration with researchers in
the Knowledge Integration and Discovery Systems group at GlaxoSmithKline,
and pediatric oncology, in collaboration with researchers in the eGenome
group at Children's Hospital of Pennsylvania. These applications,
worthwhile in their own right, provide excellent test beds for broader
research efforts in natural language processing and data integration.
In particular, we propose to develop and test new general methods for
information extraction from text, based on on-going research at Penn in
corpus-based algorithms for parsing, predicate-argument analysis and
reference resolution. We will collaborate with groups led by Robert
Gaizauskas at the University of Sheffield (UK) and by Jun-ichi Tsujii at
the University of Tokyo (Japan), as well as with the GSK and CHOP groups,
in applying these general methods to particular problems in biomedical
information extraction. The GSK group has already made effective use of
the best available information-extraction technology, so that the new
techniques can be assessed in the drug-development application for their
added value relative to the state of the art.
To give a simple, concrete example, we want a program that will read a
phrase like
Amiodarone weakly inhibited CYP2C9, CYP2D6, and CYP3A4-mediated
activities with Ki values of 45.1--271.6 μM
and add to a database a set of entries whose ordinary-language
presentation is
amiodarone inhibits CYP2C9 with Ki=45.1--271.6 amiodarone
inhibits CYP2D6 with Ki=45.1--271.6 amiodarone inhibits CYP3A4 with
Ki=45.1--271.6
This project will also address several database research problems,
including methods for modeling complex, incomplete and changing
information using semistructured data, and also ways to connect the text
analysis process to an information integration environment that can deal
with the wide variety of extant bioinformatic data models, formats,
languages and interfaces. Such problems are central to current database
research at Penn. Our work will build on the progress represented by
Penn's K2/Kleisli data integration environment, by extending it to deal
with semi-structured data. This will be important managing the information
that is extracted, and also in making use of the many specialized data
resources that can be brought to bear in generating the high-accuracy text
analysis required to extract the information effectively in the first
place. Because the K2/Kleisli system is widely used in bioinformatics
applications, and in particular has been used for several years at GSK,
these extensions will facilitate collaboration with biomedical researchers
as well as supporting the IE research itself.
The engine of recent progress in language processing research has been
linguistic data: text corpora, treebanks, lexicons, test corpora for
information retrieval and information extraction, and so on. Much of this
data has been created by Penn researchers and published by Penn's
Linguistic Data Consortium. As part of the proposed project, we will
develop and publish new linguistic resources in three categories: a large
corpus of biomedical text annotated with syntactic structures
(Treebank) and shallow semantic structures ("proposition bank"
or Propbank); a large set of biomedical abstracts and full-text
articles annotated with entities and relations of interest to researchers,
such as enzyme inhibition, or mutation/cancer connections
(Factbanks); and broad-coverage lexicons and tools for the analysis
of biomedical texts.
A sister project entitled "Language, Learning and Modeling Biological
Sequences", funded via NSF ITR award EIA-0205456, is also underway at
IRCS. |