Welcome
Mining the Bibliome
PennBioIE Release 0.9

http://bioIE.ldc.upenn.edu
Funded by NSF EIA-0205448, Information Technology Research (ITR) program

This CD and the associated Web-accessible materials are Release 0.9 of the Biomedical Information Extraction Project at the University of Pennsylvania. Our data is on this disk; our software and documentation are on the Web. Our data is on this disk; our software and documentation are on the Web; all are described on this disk.

Introduction

The goal of this project is, briefly, to develop qualitatively better methods for information extraction, specifically from biomedical literature. To that end we are annotating texts in two domains of biomedical knowledge:

To date all of our texts are abstracts taken from PubMed, but we are planning to advance to the annotation of fulltext articles. They are all annotated for paragraph, sentence, token, part of speech, and biomedical entity; 642 of them (324 cyp, 318 onco) are also syntactically annotated (treebanked).

Multilevel Annotation

Many annotation projects start with an already annotated corpus, such as the Penn Treebank or the Brown Corpus, which is considered unchangeable. As a result, annotation practices have sometimes needed to make compromises which in theory would not have been necessary if the earlier annotation had been able to integrate the requirements of the later work.

Such integration is necessary here because of the ambitious scope of this project, involving highly technical biomedical texts, entity definitions driven by the needs of biomedical research, and to the goal of making the annotation levels work together as much as possible, e.g., using entity information in the treebank annotation of prenominal modifiers. It is possible because of the long term of our grant (five years), and because we are starting with fresh text, applying all levels of annotation ourselves.

Data

This disk contains 2257 annotated texts (1157 onco, 1100 cyp), in several formats:

File Naming

When a text enters the LAW System, the system assigns it a unique source ID number. To distinguish the developing states of annotation as it progresses through the various tasks, it is also assigned a unique history ID in each task.

For example, we selected PubMed abstract #119555 for the CYP domain, downloaded it, saved it as pm00119555.cyp, and loaded it into the LAW System in the CYP Pretagging task. There the LAW System assigned it source ID 1769 and history ID 5364, so the source file was there called source_file_1769_5364.src and the annotation file, once created and checked in, was source_file_1769_5364.src.ann.

This pair of files was then passed along to CYP POS Tagging, pass 1 (this was before we changed the sequence of tasks), where it was assigned history ID 5694, so the pair consisted of (source_file_1769_5694.src, source_file_1769_5694.src.ann). The source file was unchanged, but since WordFreak uses file names to associate source and annotation files, the names must be identical except for the .ann extension.

The LAW System keeps copies of all the states of annotation.

We are providing a bidirectional index for each domain. Each index file lists first all the source IDs in order with their corresponding PMIDs, and then all the PMIDs in order with their corresponding source IDs.

Annotation

Levels of Annotation

All our texts are annotated at the following levels:

and some of them are also syntactically annotated (642 in the current release). Paragraph, sentence, tokenization, POS, and syntactic annotation (Treebanking) are applied by automatic taggers and manually corrected; entity annotation is manual. Originally we used a POS tagger trained on Penn Treebank data, which made many absurd errors on the very different text of these biomedical abstracts. When we had enough manually corrected data to train a tagger, overall accuracy rose from 88.53% to 97.33% (Kulick et al., 2004).

Annotation at all levels except Entity is based on the guidelines and principles developed for the Penn Treebank, with many modifications that we have found necessary. Entity definitions come originally from our domain experts and are developed and refined in dialogue with the annotators. All annotation is standoff: the source text is never modified, and all annotations are made in a separate file.

Our definitions and documentation are maintained online for our annotators' use. The development of these definitions and guidelines is an ongoing effort, as our biomedical domain experts refine their definitions of the categories to be annotated, as we venture into different areas of the domain (for example, concentrating on neuroblastoma within the oncology domain), and as our annotators report back from the front lines on what happens when theory (definitions and guidelines) meets data (text).

Notes on Annotation

span, character offset: WordFreak stores the span of each annotated string as two integers: the position of the first character in the string with respect to the entire source file, counting the first character of the file as 0; and the position of the last character in the string plus 1. So, for example, if the first line of the article is "The Endochronic Properties of Resublimated Thiotimoline", the first token, "The", has the span 0..3, and the second, "Endochronic", has the span 5..16.

token, part of speech: Token and POS annotation map one-to-one: every token is coterminous with a POS, and vice versa. Some of our tokens contain white space, when it is included between parentheses inside a single chemical word; some are substrings of what would be considered a token in Penn Treebank.

chains (discontinuous entity references): We have developed a way to annotate discontinuous entity references such as

We call these discontinuous references chains. In the first example, "CYP1A2" is a chain of two links. At present they are implemented as separately tagged entity references, all bearing the same label -- in this case, "cyp" -- whose connection is recorded in a separate line of XML listing their annotation-ID numbers, which are unique within each annotation file.

In order to minimize incompatibilities between entity and treebank annotation we currently restrict chaining to a small number of types of construction, which however include most of the cases of discontinuous entity references that we would like to tag.

In many places in the online documentation the text of a chain is presented like this: "CYP1A (+) 2". ("CYP1A1" is not annotated as a chain, being entirely contiguous.) Its span is shown as "[1284..1289, 1291..1292]" in the HTML view files and as "1284.. ..1292" in WordFreak. The treebank files do not recognize chains as such; see the Addendum to the Penn Treebank II Style Bracketing Guidelines: BioMedical Treebank Annotation [pdf] [txt] (Warner et al. 2004)

Sentences and Sections: The source files contain two kinds of material:

These all need to be treebanked, but to distinguish between them, in sentence annotation we tag every sentence in biomedical text with the label "Sentence", and every sentence elsewhere as a "Section".

Grammatically, a Sentence or Section can be a sentence fragment.

retired entities: A retired entity is one whose label is no longer used in current annotation, but which has been left in at least some of these files because we are in transition to more precise labeling, and because it may be of use. The only such type in these files is Malignancy. This release analyzes malignancies by seven attributes (and we are currently adding more), but originally we instructed our annotators to capture the maximal string describing a malignancy, in what we now call information-gathering mode. It was by analyzing these long strings that we have been able to isolate attributes that are significant to researchers and simple enough to promise automated tagging.

Treebanking

Our treebanking is based on the Penn Treebank II guidelines, with a number of modifications that currently form the Addendum to the Penn Treebank II Style Bracketing Guidelines: BioMedical Treebank Annotation [pdf] [txt] running to about 65 pages.

Sequence of Annotation

The basic sequence of annotation is

  1. Paragraph
  2. Sentence
  3. Token
  4. Entity
  5. Part of speech
  6. Treebanking
At first we were annotating POS before entities, but the entity annotators sometimes need to change the tokenization, which in turn requires POS correction. The present order allows the POS annotators to be guided by the entity annotators' decisions rather than trying to anticipate them.

All the annotators except the entity annotators work on both domains.

Documentation

Our online annotators' documentation as of the date of this release has been frozen as a reference for this release. The evolving documentation will be kept here.

Software

Our software is open-source and can be downloaded from SourceForge.net:

Publications

See our online Publications page for these documents.

Seth Kulick, Ann Bies, Mark Liberman, Mark Mandel, Ryan McDonald, Martha Palmer, Andrew Schein and Lyle Ungar, Integrated Annotation for Biomedical Information Extraction (HLT/NAACL, Boston, May 2004)

An earlier sample of material was released informally in March 2004 on a CD handed out at the BioCreAtIvE workshop, titled simply "UPenn Biomedical Information Extraction Project".

Institutional Contact

Linguistic Data Consortium
3600 Market Street, Suite 810
Philadelphia, PA 19104-2653, USA
ldc@ldc.upenn.edu
Telephone: +1 (215) 898-0464
Fax: +1 (215) 573-2175
Institute for Research in Cognitive Science
3401 Walnut Street, Suite 400A
Philadelphia, PA 19104
USA
ircs-info@ircs.upenn.edu
Telephone: +1 (215) 898-0357
Fax: +1 (215) 573-9247

Personnel

The following is a list of all the contributors involved in the biomedical information extraction project, "Mining the Bibliome":

University of Pennsylvania

Pretagging Annotators: Matt Leger, Michael Noda

POS Annotators: Dhinakaran Chinappen, Melissa Demian, Benjamin George, Kaira Gui, Justin Lacasse, Alexis Lerro, Mark Manocchio, Brad Moatz, Ben Newman, Jesse Palma, Ariel Richmond, Sarah Stippich, Ryan J. Tracy, Sophia Varghese, Johanna Wright

CYP450 Annotators: Jee Bang, Hareesh Chandrupatla, Robin Golden, Sanipa Koetsawasdi, Sina Neshatian, Nilay Shah, Rachel Choi Swetz, Lakiya Wimbish, Christopher Wright

Oncology Annotators: Avik Basu, Dan Caroff, Jacqueline Ewing, Amy Felix, Nadeene Francesco, Ari Goldberg, Brian Golden, Brett Merves, Karen Rudo, Jonathan Schwartz, Sabrina Sumner, Amanda van Scoyoc, Julie Wang

Treebankers: Christine Brisson, Grace Mrowicki

Programmer Analysts: Jeremy LaCivita, Eric Pancoast

Faculty: Susan Davidson, Aravind Joshi, Mark Liberman, Mitch Marcus, Martha Palmer, Fernando Pereira, Val Tannen, Lyle Ungar

Students, postdoctorates, staff : Ann Bies, Hubert Jin, Seth Kulick, Mark Mandel, Marty McCormick, Ryan McDonald, Tom Morton, Michael Patek, Ted Sandler, Andrew Schein, Rishi Talreja, Alex Vasserman, Peng Wang, Colin Warner, Dalal Zakhary, Ramez Zakhary

GlaxoSmithKline

James A. Butler, Paula Matuszek

Children's Hospital of Philadelphia

Peter White, Yang Jin, Scott Winters, Jessica Kim

Sanger Institute

Sally Bamford, Elisabeth Dawson, Jon Teague, Richard Wooster

Others

Robert Gaizauskas (Sheffield), Jun-ichi Tsujii (Tokyo), Bonnie Webber (Edinburgh)

Notes

HTML View: The HTML View files display the source text as it appears in PubMed's "abstract" format. All entity references are underlined and in red. In addition, as you move the mouse pointer across the text, the annotated spans encompassing its present location -- paragraph, sentence/section, token-part of speech, and entity -- will be shown by background highlighting in different colors and patterns (not supported by Internet Explorer):

Clicking the mouse at any point in the text will open a window listing the annotations in detail. At the top of the window are buttons to display the annotation legend as described above, to control the levels of annotation displayed, and to link to the article in PubMed.

The spans and the labels will also display on the status bar at the bottom of the window. This functionality is enabled by default in Internet Explorer, but in Mozilla Firefox you must enable it:

  1. Open the Tools menu
  2. Click Options
  3. Click Web Features in the left-hand column
  4. Click the "Advanced..." button (not the gear icon labeled "Advanced")
  5. Check "Change status bar text"

Changes in PubMed: Since beginning our annotation, we have occasionally found discrepancies between the texts as we downloaded them from PubMed and as PubMed now presents them. These have generally consisted of cleanup of textual irregularities, such as "- >" for "->".