|
Penn BioIE, the biomedical information extraction project at
the University of Pennsylvania
The JULIE Lab at Jena University is
offering several NLP
tools, including a
named entity tagger (JNET) trained on PennBioIE oncology data, and a
sentence splitter (JSBD) and a tokenizer (JTBD) based in part on PennBioIE
data.
FABLE
(Fast Automated Biomedical Literature Extraction) is an online
application developed at The Children's Hospital of Philadelphia to
mine biomedical literature for information about human genes and
proteins, using data developed by the PennBioIE project. FABLE enables users to search MEDLINE® and PubMed®, more effectively than PubMed's own search engine, for
- articles mentioning specific genes or proteins, even if the article
uses a different name than the one specified in the search (Article Finder)
- sets of genes that are mentioned in articles containing one or more
keyword search terms of interest, singly or in arbitrary Boolean
combinations, and to view a list of all known synonyms for each gene (Gene
Lister)
New features in FABLE Version 3.0:
- Literature-enabled local mirror of the UCSC Genome Browser
- New "Did you mean" feature for Gene Lister: when a query returns no
results, alternate spellings are suggested
- Faster searches
PennBioIE, the biomedical information
extraction project at the University of Pennsylvania, announces the Web
availability of Release 0.9 of its data for
unrestricted free download.
The corpus consists of 2258 Medline abstracts in two
domains:
- the molecular genetics of oncology (1158 abstracts)
- the inhibition of enzymes of the CYP450 class (1100 abstracts)
All of the abstracts have been manually annotated for
paragraphs, sentences, part of speech, and a set of biomedical entity
types defined for this project and specific to each domain. Our entity definitions have been developed by domain experts from GSK (for CYP450) and Children's Hospital of Philadelphia (for oncology).
In addition, 642 of the abstracts have been
syntactically annotated (318 oncology, 324 CYP). The entity annotations
are consistent with the syntactic annotation to the fullest extent possible. The part of speech and syntactic annotation are in line
with Penn Treebank guidelines, with modifications and additions as
necessary.
We have developed a number of taggers from these manual
annotations, resulting in several publications.
Innovations
We believe that we have taken the integration of domain
knowledge, entity annotation, and syntactic annotation to a new level
with this project. We have also devised and implemented a way to
annotate discontinuous entities, which we call "chains", such as
"sulfuric acid" in "sulfuric or hydrochloric acid".
I."Data" section of the
website:
II. Tools
Outline of Entity Annotation File Format
All our annotation is standoff, so that an "annotated
file"
generally comprises two files:
- the unmodified source text
- an annotation file referring to the source text by
character
offsets
(span="0..10" for the first
token of a file beginning
"Antimicrob Agents Chemother.")
Our tool for paragraph, sentence, entity, and POS
annotation --
everything except Treebanking -- is WordFreak; the code,
documentation, and output format are available in open source
through SourceForge. Its annotation files have the suffix ".ann".
The following sections of our annotators' documentation describe the
format from the annotator's point of view:
- Pre-tagging,
as we call it: paragraphs, sentences (including "sections" --
non-biomedical text), and tokens. Much of the contents of this
document is repeated in the other documents in this list, but not all
of it, as they are aimed at different groups of annotators.
- Entity
annotation general principles
- Tag-within-tag:
Entity tags embedded within other entity tags
- POS
annotation basics
- Discontinuous
entity references ("chains")
Briefly, the .ann file is an XML file most of whose
contents consist of the following nested components:
- file
- paragraph
- sentence or section
- optionally, entity,
possibly comprising the span of other entities as substrings
The annotations have sequential ID numbers, starting
with 1, in the order of their beginning tags. Every annotation is
labeled with a type: file, paragraph, sentence, section, or the name of
one of our entities or parts of speech.
Every non-whitespace character must be contained in
exactly one of
each of these elements, except for the entity level, about which more
below. There is only one file component per .ann file. Sentence and
section are structurally equivalent, but we use the section tag for
non-biomedical text such as authors' names and affiliations.
We don't check tokenization or pos tags or
tag entities within
the sections, other than making sure that every
non-whitespace
character is nested correctly. You may find a "token" tag in lieu of a
POS tag, most likely when an entity annotator has had to change the
tokenization of a string of text and the POS annotators have missed
that one in cleaning up.
Entities apply to strings of tokens, never parts of
tokens, which
is why an entity annotator may have to change the results of the
automatic tokenization. We generally avoid nested
entities, but in certain cases we allow or require
them. Three-deep nesting --
[entity
[entity [entity]]]
-- is theoretically allowed, but there are probably no cases of it, at
least in this release. Nested entities are not necessarily
coextensive; the inner one is usually a proper substring of the outer.
Down at the end of the .ann file are optionally two more
types of
component, chains and relations (only chains in the release 0.9
data). A chain
is our way of annotating a discontinuous string. We use chaining only
for entity references, not for any other level, and we restrict it
quite tightly (ibid.). A chain consists of two or more
entity annotations of the same type, identified by their ID numbers.
Their order in the chain is not supposed to be distinctive, but we
have found that which one is tagged first may have some effect on
subsequent annotations, so it is possible that some the chains in the
XML files list the ID numbers in non-sequential order.
2005-07-15
==================================================================================
-- Mark
A. Mandel, Research
Administrator, PennBioIE
Linguistic Data Consortium, University of Pennsylvania
|