Email Contact: bioie@ldc.upenn.edu
This material is based upon work supported by the National Science Foundation under Grant No.: EIA-0205448
Home
      -Overview
      -About Us/Credits
Resources
      -Data
      -Publications
      -Works of Interest
Documentation
      -User Guide
      -Guided Tour
      -Description of Data
      -Software/Tools
      -Docs for Annotators
Software/Tools
      -WordFreak
      -LAW Workflow System
      -Annotation Database
      -Auto Text Processing
      -TreeEditor
      -Taggers
Archive Releases
 
Printer Friendly Version
 
Mining the Bibliome
Penn BioIE, the biomedical information extraction project at the University of Pennsylvania

JULIE Lab - NLP Tool Suite

The JULIE Lab at Jena University is offering several NLP tools, including a named entity tagger (JNET) trained on PennBioIE oncology data, and a sentence splitter (JSBD) and a tokenizer (JTBD) based in part on PennBioIE data.

FABLE

FABLE (Fast Automated Biomedical Literature Extraction) is an online application developed at The Children's Hospital of Philadelphia to mine biomedical literature for information about human genes and proteins, using data developed by the PennBioIE project. FABLE enables users to search MEDLINE® and PubMed®, more effectively than PubMed's own search engine, for

  • articles mentioning specific genes or proteins, even if the article uses a different name than the one specified in the search (Article Finder)
  • sets of genes that are mentioned in articles containing one or more keyword search terms of interest, singly or in arbitrary Boolean combinations, and to view a list of all known synonyms for each gene (Gene Lister)

New features in FABLE Version 3.0:

  • Literature-enabled local mirror of the UCSC Genome Browser
  • New "Did you mean" feature for Gene Lister: when a query returns no results, alternate spellings are suggested
  • Faster searches

PennBioIE Release 0.9

PennBioIE, the biomedical information extraction project at the University of Pennsylvania, announces the Web availability of Release 0.9 of its data for unrestricted free download.

The corpus consists of 2258 Medline abstracts in two domains:

  • the molecular genetics of oncology (1158 abstracts)
  • the inhibition of enzymes of the CYP450 class (1100 abstracts)

All of the abstracts have been manually annotated for paragraphs, sentences, part of speech, and a set of biomedical entity types defined for this project and specific to each domain. Our entity definitions have been developed by domain experts from GSK (for CYP450) and Children's Hospital of Philadelphia (for oncology).

In addition, 642 of the abstracts have been syntactically annotated (318 oncology, 324 CYP). The entity annotations are consistent with the syntactic annotation to the fullest extent possible. The part of speech and syntactic annotation are in line with Penn Treebank guidelines, with modifications and additions as necessary.

We have developed a number of taggers from these manual annotations, resulting in several publications.

Innovations

We believe that we have taken the integration of domain knowledge, entity annotation, and syntactic annotation to a new level with this project. We have also devised and implemented a way to annotate discontinuous entities, which we call "chains", such as "sulfuric acid" in "sulfuric or hydrochloric acid".

    I."Data" section of the website:

    II. Tools

Outline of Entity Annotation File Format

All our annotation is standoff, so that an "annotated file" generally comprises two files:

  • the unmodified source text
  • an annotation file referring to the source text by character offsets
    (span="0..10" for the first token of a file beginning "Antimicrob Agents Chemother.")

Our tool for paragraph, sentence, entity, and POS annotation -- everything except Treebanking -- is WordFreak; the code, documentation, and output format are available in open source through SourceForge. Its annotation files have the suffix ".ann". The following sections of our annotators' documentation describe the format from the annotator's point of view:

  • Pre-tagging, as we call it: paragraphs, sentences (including "sections" -- non-biomedical text), and tokens. Much of the contents of this document is repeated in the other documents in this list, but not all of it, as they are aimed at different groups of annotators.
  • Entity annotation general principles
  • Tag-within-tag: Entity tags embedded within other entity tags
  • POS annotation basics
  • Discontinuous entity references ("chains")

Briefly, the .ann file is an XML file most of whose contents consist of the following nested components:

  • file
    • paragraph
      • sentence or section
        • optionally, entity, possibly comprising the span of other entities as substrings
          • POS-tagged token

The annotations have sequential ID numbers, starting with 1, in the order of their beginning tags. Every annotation is labeled with a type: file, paragraph, sentence, section, or the name of one of our entities or parts of speech.

Every non-whitespace character must be contained in exactly one of each of these elements, except for the entity level, about which more below. There is only one file component per .ann file. Sentence and section are structurally equivalent, but we use the section tag for non-biomedical text such as authors' names and affiliations.

We don't check tokenization or pos tags or tag entities within the sections, other than making sure that every non-whitespace character is nested correctly. You may find a "token" tag in lieu of a POS tag, most likely when an entity annotator has had to change the tokenization of a string of text and the POS annotators have missed that one in cleaning up.

Entities apply to strings of tokens, never parts of tokens, which is why an entity annotator may have to change the results of the automatic tokenization. We generally avoid nested entities, but in certain cases we allow or require them. Three-deep nesting --
     [entity [entity [entity]]]
-- is theoretically allowed, but there are probably no cases of it, at least in this release. Nested entities are not necessarily coextensive; the inner one is usually a proper substring of the outer.

Down at the end of the .ann file are optionally two more types of component, chains and relations (only chains in the release 0.9 data). A chain is our way of annotating a discontinuous string. We use chaining only for entity references, not for any other level, and we restrict it quite tightly (ibid.). A chain consists of two or more entity annotations of the same type, identified by their ID numbers. Their order in the chain is not supposed to be distinctive, but we have found that which one is tagged first may have some effect on subsequent annotations, so it is possible that some the chains in the XML files list the ID numbers in non-sequential order.

2005-07-15

==================================================================================

-- Mark A. Mandel, Research Administrator, PennBioIE
Linguistic Data Consortium, University of Pennsylvania