Email Contact: bioie@ldc.upenn.edu
This material is based upon work supported by the National Science Foundation under Grant No.: EIA-0205448
Home
      -Overview
      -About Us/Credits
Resources
      -Data
      -Publications
      -Works of Interest
Documentation
      -User Guide
      -Guided Tour
      -Description of Data
      -Software/Tools
      -Docs for Annotators
Software/Tools
      -WordFreak
      -LAW Workflow System
      -Annotation Database
      -Auto Text Processing
      -TreeEditor
      -Taggers
Archive Releases
 
Printer Friendly Version
 
Mining the Bibliome
Penn BioIE, the biomedical information extraction project at the University of Pennsylvania

PennBioIE Release 1.0

PennBioIE, the biomedical information extraction project at the University of Pennsylvania, announces Release 1.0 of its CYP and Oncology corpora through the Linguistic Data Consortium. The files have been tokenized and their biomedical portions annotated for paragraph, sentence, and part of speech, and for biomedical named entity types relevant to the topic. A subset of the files in each corpus have also been syntactically annotated.

Annotation at all layers except entity is based on the Penn Treebank II guidelines with a number of modifications, many of them subsequently adopted by the Penn Treebank. All annotations are standoff. Paragraph, sentence, tokenization, POS, and syntactic annotation (treebanking) were applied by automatic taggers and manually corrected; entity annotation was manual. The annotation procedure and format are briefly outlined below.

Annotations in biomedical text:
abstract title and text, but not authors' names, affiliations, etc.
sentences tagged tokens entities
CYP 11,933 274,167 52,710
Oncology 14,668 330,680 61,058
total 26,601 604,847 113,768

PennBioIE CYP 1.0

The PennBioIE CYP corpus (ISBN 1-58563-498-0) consists of 1100 PubMed abstracts on the inhibition of cytochrome P450 enzymes, comprising approximately 313,000 total words of text. Each file has been tokenized and its biomedical portions (274,000 words) exhaustively annotated for paragraph, sentence, and part of speech, and non-exhaustively annotated for 5 types of biomedical named entity in three categories of interest, in collaboration with GlaxoSmithKline R&D. 324 of the abstracts have also been syntactically annotated.

PennBioIE Oncology 1.0

The PennBioIE Oncology corpus (ISBN 1-58563-490-5) consists of 1414 PubMed abstracts on cancer concentrating on molecular genetics, comprising approximately 381,000 total words of text. Each file has been tokenized and its biomedical portions (327,000 words) exhaustively annotated for paragraph, sentence, and part of speech, and non-exhaustively annotated for 16 ("Level 1") or 23 ("Level 2") types of named entity, in collaboration with Children's Hospital of Philadelphia. 318 of the abstracts have also been syntactically annotated.

There are two oncology subcorpora:
  • The Sanger subcorpus consists of abstracts of 577 articles selected at the Sanger Institute for mention of oncological named entities, concentrating on variations in a small set of human genes associated with many different types of cancer.
  • The neuroblastoma subcorpus consists of 837 abstracts of articles dealing with this particular type of cancer, selected by colleagues at Children's Hospital of Philadelphia. They do not all concentrate on genetics, but they mention a much larger number of genes than the Sanger files do.

Related Projects

JULIE Lab - NLP Tool Suite

The JULIE Lab at Jena University offers several NLP tools, including a named entity tagger (JNET) trained on PennBioIE oncology data, and a sentence splitter (JSBD) and a tokenizer (JTBD) based in part on PennBioIE data.

FABLE

FABLE (Fast Automated Biomedical Literature Extraction) is an online application developed at The Children's Hospital of Philadelphia to mine biomedical literature for information about human genes and proteins, using data developed by the PennBioIE project. FABLE enables users to search MEDLINE® and PubMed®, more effectively than PubMed's own search engine, for

  • articles mentioning specific genes or proteins, even if the article uses a different name than the one specified in the search (Article Finder)
  • sets of genes that are mentioned in articles containing one or more keyword search terms of interest, singly or in arbitrary Boolean combinations, and to view a list of all known synonyms for each gene (Gene Lister)

New features in FABLE Version 3.0:

  • Literature-enabled local mirror of the UCSC Genome Browser
  • New "Did you mean" feature for Gene Lister: when a query returns no results, alternate spellings are suggested
  • Faster searches

PennBioIE Release 0.9

The earlier, interim version of our corpus, Release 0.9, remains available for unrestricted free download.

We have developed a number of taggers from these manual annotations, resulting in several publications.

Innovations

We believe that we have taken the integration of domain knowledge, entity annotation, and syntactic annotation to a new level with this project. We have also devised and implemented a way to annotate discontinuous entities, which we call "chains", such as "sulfuric acid" in "sulfuric or hydrochloric acid".

    I."Data" section of the website:

    II. Tools

Outline of Entity Annotation File Format

All our annotation is standoff, so that an "annotated file" generally comprises two files:

  • the unmodified source text
  • an annotation file referring to the source text by character offsets
    (e.g., span="0..10" for the first token of a file beginning "Antimicrob Agents Chemother.")

Our tool for paragraph, sentence, entity, and POS annotation -- everything except Treebanking -- is WordFreak; the code, documentation, and output format are available in open source through SourceForge. Its annotation files have the suffix ".ann". The following sections of our annotators' documentation describe the format from the annotator's point of view:

  • Pre-tagging, as we call it: paragraphs, sentences (including "sections" -- non-biomedical text), and tokens. Much of the contents of this document is repeated in the other documents in this list, but not all of it, as they are aimed at different groups of annotators.
  • Entity annotation general principles
  • Tag-within-tag: Entity tags embedded within other entity tags
  • POS annotation basics
  • Discontinuous entity references ("chains")

Briefly, the .ann file is an XML file most of whose contents consist of the following nested components:

  • file
    • paragraph
      • sentence or section
        • optionally, entity, possibly comprising the span of other entities as substrings
          • POS-tagged token

The annotations have sequential ID numbers, starting with 1, in the order of their beginning tags. Every annotation is labeled with a type: file, paragraph, sentence, section, or the name of one of our entities or parts of speech.

Every non-whitespace character must be contained in exactly one of each of these elements, except for the entity level, about which more below. There is only one file component per .ann file. Sentence and section are structurally equivalent, but we use the section tag for non-biomedical text such as authors' names and affiliations.

We don't check tokenization or pos tags or tag entities within the sections, other than making sure that every non-whitespace character is nested correctly. You may find a "token" tag in lieu of a POS tag, most likely when an entity annotator has had to change the tokenization of a string of text and the POS annotators have missed that one in cleaning up.

Entities apply to strings of tokens, never parts of tokens, which is why an entity annotator may have to change the results of the automatic tokenization. We generally avoid nested entities, but in certain cases we allow or require them. Three-deep nesting --
     [entity [entity [entity]]]
-- is theoretically allowed, but there are probably no cases of it, at least in this release. Nested entities are not necessarily coextensive; the inner one is usually a proper substring of the outer.

Down at the end of the .ann file are optionally two more types of component, chains and relations (only chains in the release 0.9 data). A chain is our way of annotating a discontinuous string. We use chaining only for entity references, not for any other level, and we restrict it quite tightly (ibid.). A chain consists of two or more entity annotations of the same type, identified by their ID numbers. Their order in the chain is not supposed to be distinctive, but we have found that which one is tagged first may have some effect on subsequent annotations, so it is possible that some the chains in the XML files list the ID numbers in non-sequential order.


2009-02-09

Mark A. Mandel, Research Administrator, PennBioIE
Linguistic Data Consortium, University of Pennsylvania