http://bioIE.ldc.upenn.edu
Funded by NSF EIA-0205448, Information Technology Research (ITR) program
This CD and the associated Web-accessible materials are Release 0.9 of the Biomedical Information Extraction Project at the University of Pennsylvania. Our data is on this disk; our software and documentation are on the Web. Our data is on this disk; our software and documentation are on the Web; all are described on this disk.
The goal of this project is, briefly, to develop qualitatively better methods for information extraction, specifically from biomedical literature. To that end we are annotating texts in two domains of biomedical knowledge:
To date all of our texts are abstracts taken from PubMed, but we are planning to advance to the annotation of fulltext articles. They are all annotated for paragraph, sentence, token, part of speech, and biomedical entity; 642 of them (324 cyp, 318 onco) are also syntactically annotated (treebanked).
Many annotation projects start with an already annotated corpus, such as the Penn Treebank or the Brown Corpus, which is considered unchangeable. As a result, annotation practices have sometimes needed to make compromises which in theory would not have been necessary if the earlier annotation had been able to integrate the requirements of the later work.
Such integration is necessary here because of the ambitious scope of this project, involving highly technical biomedical texts, entity definitions driven by the needs of biomedical research, and to the goal of making the annotation levels work together as much as possible, e.g., using entity information in the treebank annotation of prenominal modifiers. It is possible because of the long term of our grant (five years), and because we are starting with fresh text, applying all levels of annotation ourselves.
This disk contains 2257 annotated texts (1157 onco, 1100 cyp), in several formats:
HTML View: All annotations except for treebanking, in a format based on the WordFreak interface that can be displayed (but not modified) with an HTML browser. The display also provides a link to the abstract on PubMed. Microsoft Internet Explorer will display the data and annotations, but some of the functionality will be missing because Internet Explorer does not comply with all relevant standards. (Description here.) For Windows and UNIX-family systems we recommend the open-source Mozilla Firefox browser, which can be downloaded here; for Mac OS X, either Firefox or Safari. These files are uncompressed.
WordFreak: Annotation files for paragraph, sentence, token, part of speech, and entity annotation. These annotations are stored in a separate file from the source text, referring to the source file by character offsets. These can be viewed and manipulated with Wordfreak, which reads in a source file (e.g., source_file_1234_56789.src) and its corresponding annotation file (source_file_1234_56789.src.ann), records any added or changed annotations in the annotation file, leaving the source file untouched. These files are compressed with WinZip.
Treebank: Treebank annotation files stored as plain text in the Penn Treebank format, plus the following types of additional information.
Machine learning: Annotation information extracted from the WordFreak files presented in an XML format that may be useful for machine learning applications. Unlike the other formats, which present each source text and its annotations as a single file or pair of files, in this format all the data for each domain is gathered into a single WinZip file.
When a text enters the LAW System, the system assigns it a unique source ID number. To distinguish the developing states of annotation as it progresses through the various tasks, it is also assigned a unique history ID in each task.
For example, we selected PubMed abstract #119555 for the CYP domain, downloaded it, saved it as pm00119555.cyp, and loaded it into the LAW System in the CYP Pretagging task. There the LAW System assigned it source ID 1769 and history ID 5364, so the source file was there called source_file_1769_5364.src and the annotation file, once created and checked in, was source_file_1769_5364.src.ann.
This pair of files was then passed along to CYP POS Tagging, pass 1 (this was before we changed the sequence of tasks), where it was assigned history ID 5694, so the pair consisted of (source_file_1769_5694.src, source_file_1769_5694.src.ann). The source file was unchanged, but since WordFreak uses file names to associate source and annotation files, the names must be identical except for the .ann extension.
The LAW System keeps copies of all the states of annotation.
We are providing a bidirectional index for each domain. Each index file lists first all the source IDs in order with their corresponding PMIDs, and then all the PMIDs in order with their corresponding source IDs.
All our texts are annotated at the following levels:
Annotation at all levels except Entity is based on the guidelines and principles developed for the Penn Treebank, with many modifications that we have found necessary. Entity definitions come originally from our domain experts and are developed and refined in dialogue with the annotators. All annotation is standoff: the source text is never modified, and all annotations are made in a separate file.
Our definitions and documentation are maintained online for our annotators' use. The development of these definitions and guidelines is an ongoing effort, as our biomedical domain experts refine their definitions of the categories to be annotated, as we venture into different areas of the domain (for example, concentrating on neuroblastoma within the oncology domain), and as our annotators report back from the front lines on what happens when theory (definitions and guidelines) meets data (text).
span, character offset: WordFreak stores the span of each annotated string as two integers: the position of the first character in the string with respect to the entire source file, counting the first character of the file as 0; and the position of the last character in the string plus 1. So, for example, if the first line of the article is "The Endochronic Properties of Resublimated Thiotimoline", the first token, "The", has the span 0..3, and the second, "Endochronic", has the span 5..16.
token, part of speech: Token and POS annotation map one-to-one: every token is coterminous with a POS, and vice versa. Some of our tokens contain white space, when it is included between parentheses inside a single chemical word; some are substrings of what would be considered a token in Penn Treebank.
chains (discontinuous entity references): We have developed a way to annotate discontinuous entity references such as
In order to minimize incompatibilities between entity and treebank annotation we currently restrict chaining to a small number of types of construction, which however include most of the cases of discontinuous entity references that we would like to tag.
In many places in the online documentation the text of a chain is presented like this: "CYP1A (+) 2". ("CYP1A1" is not annotated as a chain, being entirely contiguous.) Its span is shown as "[1284..1289, 1291..1292]" in the HTML view files and as "1284.. ..1292" in WordFreak. The treebank files do not recognize chains as such; see the Addendum to the Penn Treebank II Style Bracketing Guidelines: BioMedical Treebank Annotation [pdf] [txt] (Warner et al. 2004)
Sentences and Sections: The source files contain two kinds of material:
These all need to be treebanked, but to distinguish between them, in sentence annotation we tag every sentence in biomedical text with the label "Sentence", and every sentence elsewhere as a "Section".
Grammatically, a Sentence or Section can be a sentence fragment.
retired entities: A retired entity is one whose label is no longer used in current annotation, but which has been left in at least some of these files because we are in transition to more precise labeling, and because it may be of use. The only such type in these files is Malignancy. This release analyzes malignancies by seven attributes (and we are currently adding more), but originally we instructed our annotators to capture the maximal string describing a malignancy, in what we now call information-gathering mode. It was by analyzing these long strings that we have been able to isolate attributes that are significant to researchers and simple enough to promise automated tagging.
The basic sequence of annotation is
All the annotators except the entity annotators work on both domains.
The pretaggers apply the paragraph and sentence taggers and correct their output as necessary; and tokenize without correction. This is a largely mechanical task but does require some care.
The entity annotators are organized in two groups according to domain, CYP and oncology. They are recruited mostly from among students and graduates with some biology or medical background. We try to insulate the other types of annotators as much as possible from the need for domain knowledge.
These are the only annotators who work exclusively in a single domain. The entity definitions are developed specifically for each domain and are not used in the other one; e.g., oncology annotators don't tag measurements, and CYP annotators don't tag genes, and their toolsets don't include these labels. Furthermore, the definitions sometimes overlap or conflict between the domains: in oncology, "K-ras" can be tagged either as a Gene or as a Protein (Gene-gene/RNA, Gene-protein), while in CYP it will be tagged as a Substance if referring to a protein, but ignored when it refers to a gene.
The POS annotators apply the POS tagger and correct its output the POS tagging according to guidelines that incorporate some of the requirements of the entity and treebanking annotators, but minimize requirements for domain knowledge. This task requires familiarity with classical grammar and training in the Penn Treebank rules, as modified in the course of this project.
The treebankers' task requires a strong background in syntactic theory.
Our online annotators' documentation as of the date of this release has been frozen as a reference for this release. The evolving documentation will be kept here.
Our software is open-source and can be downloaded from SourceForge.net:
See our online Publications page for these documents.
Seth Kulick, Ann Bies, Mark Liberman, Mark Mandel, Ryan McDonald, Martha Palmer, Andrew Schein and Lyle Ungar, Integrated Annotation for Biomedical Information Extraction (HLT/NAACL, Boston, May 2004)
An earlier sample of material was released informally in March 2004 on a CD handed out at the BioCreAtIvE workshop, titled simply "UPenn Biomedical Information Extraction Project".
| Linguistic Data Consortium 3600 Market Street, Suite 810 Philadelphia, PA 19104-2653, USA ldc@ldc.upenn.edu Telephone: +1 (215) 898-0464 Fax: +1 (215) 573-2175 |
Institute for Research in Cognitive Science 3401 Walnut Street, Suite 400A Philadelphia, PA 19104 USA ircs-info@ircs.upenn.edu Telephone: +1 (215) 898-0357 Fax: +1 (215) 573-9247 |
Pretagging Annotators: Matt Leger, Michael Noda
POS Annotators: Dhinakaran Chinappen, Melissa Demian, Benjamin George, Kaira Gui, Justin Lacasse, Alexis Lerro, Mark Manocchio, Brad Moatz, Ben Newman, Jesse Palma, Ariel Richmond, Sarah Stippich, Ryan J. Tracy, Sophia Varghese, Johanna Wright
CYP450 Annotators: Jee Bang, Hareesh Chandrupatla, Robin Golden, Sanipa Koetsawasdi, Sina Neshatian, Nilay Shah, Rachel Choi Swetz, Lakiya Wimbish, Christopher Wright
Oncology Annotators: Avik Basu, Dan Caroff, Jacqueline Ewing, Amy Felix, Nadeene Francesco, Ari Goldberg, Brian Golden, Brett Merves, Karen Rudo, Jonathan Schwartz, Sabrina Sumner, Amanda van Scoyoc, Julie Wang
Treebankers: Christine Brisson, Grace Mrowicki
Programmer Analysts: Jeremy LaCivita, Eric Pancoast
Faculty: Susan Davidson, Aravind Joshi, Mark Liberman, Mitch Marcus, Martha Palmer, Fernando Pereira, Val Tannen, Lyle Ungar
Students, postdoctorates, staff : Ann Bies, Hubert Jin, Seth Kulick, Mark Mandel, Marty McCormick, Ryan McDonald, Tom Morton, Michael Patek, Ted Sandler, Andrew Schein, Rishi Talreja, Alex Vasserman, Peng Wang, Colin Warner, Dalal Zakhary, Ramez Zakhary
HTML View: The HTML View files display the source text as it appears in PubMed's "abstract" format. All entity references are underlined and in red. In addition, as you move the mouse pointer across the text, the annotated spans encompassing its present location -- paragraph, sentence/section, token-part of speech, and entity -- will be shown by background highlighting in different colors and patterns (not supported by Internet Explorer):
The spans and the labels will also display on the status bar at the bottom of the window. This functionality is enabled by default in Internet Explorer, but in Mozilla Firefox you must enable it:
Changes in PubMed: Since beginning our annotation, we have occasionally found discrepancies between the texts as we downloaded them from PubMed and as PubMed now presents them. These have generally consisted of cleanup of textual irregularities, such as "- >" for "->".