Penn BioIE, the biomedical information extraction project at
the University of Pennsylvania
PennBioIE Release 1.0
PennBioIE, the biomedical information extraction project at the University of
Pennsylvania, announces Release 1.0 of its CYP and Oncology corpora through the Linguistic Data Consortium. The files have been tokenized
and their biomedical portions annotated for paragraph, sentence, and part of speech,
and for biomedical named entity types relevant to the topic. A subset
of the files in each corpus have also been syntactically annotated.
Annotation at all layers except entity is based on the Penn Treebank II
guidelines with a number of modifications, many of them subsequently adopted by the
Penn Treebank. All annotations are standoff. Paragraph, sentence, tokenization, POS,
and syntactic annotation (treebanking) were applied by automatic taggers and manually
corrected; entity annotation was manual. The annotation procedure and format are briefly
outlined below.
Annotations in biomedical text: abstract title and text, but not authors' names,
affiliations, etc.
| sentences | tagged tokens | entities |
| CYP | 11,933 | 274,167 | 52,710 |
| Oncology | 14,668 | 330,680 | 61,058 |
| total | 26,601 | 604,847 | 113,768 |
The PennBioIE CYP corpus (ISBN 1-58563-498-0) consists of 1100
PubMed abstracts on the inhibition of cytochrome P450 enzymes, comprising
approximately 313,000 total words of text. Each file has been tokenized and its
biomedical portions (274,000 words) exhaustively annotated for paragraph,
sentence, and part of speech, and non-exhaustively annotated for 5 types of
biomedical named entity in three categories of interest, in collaboration with GlaxoSmithKline R&D. 324
of the abstracts have also been syntactically annotated.
The PennBioIE Oncology corpus (ISBN 1-58563-490-5) consists of
1414 PubMed abstracts on cancer concentrating on molecular genetics, comprising
approximately 381,000 total words of text. Each file has been tokenized and its
biomedical portions (327,000 words) exhaustively annotated for paragraph,
sentence, and part of speech, and non-exhaustively annotated for 16 ("Level 1") or 23
("Level 2") types of named entity, in collaboration with Children's Hospital of Philadelphia. 318
of the abstracts have also been syntactically annotated.
There are two oncology subcorpora: - The Sanger subcorpus consists of
abstracts of 577 articles selected at the Sanger
Institute for mention of oncological named entities, concentrating on variations
in a small set of human genes associated with many different types of cancer.
- The neuroblastoma subcorpus consists of 837 abstracts of articles dealing
with this particular type of cancer, selected by colleagues at Children's Hospital of Philadelphia. They
do not all concentrate on genetics, but they mention a much larger number of genes
than the Sanger files do.
Related Projects
The JULIE Lab at Jena
University offers several
NLP tools, including a named entity tagger (JNET) trained on
PennBioIE oncology data, and a sentence splitter (JSBD) and a tokenizer (JTBD) based
in part on PennBioIE data.
FABLE
(Fast Automated Biomedical Literature Extraction) is an online
application developed at The Children's Hospital of Philadelphia to
mine biomedical literature for information about human genes and
proteins, using data developed by the PennBioIE project. FABLE enables users to search MEDLINE® and PubMed®, more effectively than PubMed's own search engine, for
- articles mentioning specific genes or proteins, even if the article
uses a different name than the one specified in the search (Article Finder)
- sets of genes that are mentioned in articles containing one or more
keyword search terms of interest, singly or in arbitrary Boolean
combinations, and to view a list of all known synonyms for each gene (Gene
Lister)
New features in FABLE Version 3.0:
- Literature-enabled local mirror of the UCSC Genome Browser
- New "Did you mean" feature for Gene Lister: when a query returns no
results, alternate spellings are suggested
- Faster searches
PennBioIE Release 0.9
The earlier, interim version of our corpus,
Release 0.9, remains available for
unrestricted free download.
We have developed a number of taggers from these manual
annotations, resulting in several publications.
Innovations
We believe that we have taken the integration of domain
knowledge, entity annotation, and syntactic annotation to a new level
with this project. We have also devised and implemented a way to
annotate discontinuous entities, which we call "chains", such as
"sulfuric acid" in "sulfuric or hydrochloric acid".
I."Data" section of the
website:
1. entity-annotated data (including paragraph,
sentence, and POS):
A. HTML View:
View and read source texts with a Web browser. The annotations are
displayed on screen in various levels of detail that you can control as
you view them. Also provides a link to the PubMed article as posted by
the NIH.
B. WordFreak Files:
Download any or all of the files in the format used by our entity
annotation tool, WordFreak: the source text in a plain text file and
the annotations in an XML file.
C. Extracted ML Data:
Extracted annotation information in XML format that may be useful for
machine learning purposes.
2. treebanked data
Download any or all of the files in the .mrg file format available from
the Penn Treebank, with each tree shown in bracketed form, along with
some additional documented information.
II. Tools
Outline of Entity Annotation File Format
All our annotation is standoff, so that an "annotated
file"
generally comprises two files:
- the unmodified source text
- an annotation file referring to the source text by
character
offsets
(e.g., span="0..10" for the first
token of a file beginning "Antimicrob Agents Chemother.")
Our tool for paragraph, sentence, entity, and POS
annotation --
everything except Treebanking -- is WordFreak; the code,
documentation, and output format are available in open source
through SourceForge. Its annotation files have the suffix ".ann".
The following sections of our annotators' documentation describe the
format from the annotator's point of view:
- Pre-tagging,
as we call it: paragraphs, sentences (including "sections" --
non-biomedical text), and tokens. Much of the contents of this
document is repeated in the other documents in this list, but not all
of it, as they are aimed at different groups of annotators.
- Entity
annotation general principles
- Tag-within-tag:
Entity tags embedded within other entity tags
- POS
annotation basics
- Discontinuous
entity references ("chains")
Briefly, the .ann file is an XML file most of whose
contents consist of the following nested components:
- file
- paragraph
- sentence or section
- optionally, entity,
possibly comprising the span of other entities as substrings
The annotations have sequential ID numbers, starting
with 1, in the order of their beginning tags. Every annotation is
labeled with a type: file, paragraph, sentence, section, or the name of
one of our entities or parts of speech.
Every non-whitespace character must be contained in exactly one of each of
these elements, except for the entity level, about which more below. There is only
one file component per .ann file. Sentence and section are structurally equivalent,
but we use the section tag for non-biomedical text such as authors' names and
affiliations.
We don't check tokenization or pos tags or tag entities within the
sections, other than making sure that every non-whitespace character is nested
correctly. You may find a "token" tag in lieu of a POS tag, most likely when an
entity annotator has had to change the tokenization of a string of text and the POS
annotators have missed that one in cleaning up.
Entities apply to strings of tokens, never parts of tokens, which is why an
entity annotator may have to change the results of the automatic tokenization. We
generally avoid nested
entities, but in certain cases we allow or require them. Three-deep nesting --
[entity [entity [entity]]]
-- is theoretically allowed, but there are probably no cases of it, at
least in this release. Nested entities are not necessarily
coextensive; the inner one is usually a proper substring of the outer.
Down at the end of the .ann file are optionally two more types of component,
chains and relations (only chains in the release 0.9 data). A chain
is our way of annotating a discontinuous string. We use chaining only for entity
references, not for any other level, and we restrict it quite tightly (ibid.). A
chain consists of two or more entity annotations of the same type, identified by
their ID numbers. Their order in the chain is not supposed to be distinctive, but we
have found that which one is tagged first may have some effect on subsequent
annotations, so it is possible that some the chains in the XML files list the ID
numbers in non-sequential order.
2009-02-09
Mark A. Mandel, Research Administrator, PennBioIE
Linguistic Data Consortium, University of Pennsylvania
|