Biomedical annotation fixup

Annotators' home


The Problem

Many of our annotated files contain errors of one sort or another. These include, but are not limited to:

Some of these errors can be detected and fixed automatically, such as hierarchy errors. Some can be detected but not fixed, such as untokenized text, which needs tokenization and POS labeling, both with human checking. Some kinds of error can be detected as "probable error", needing a human's eye to determine if they are wrong or just unusual, and, if wrong, to correct them. Some cannot be automatically handled at all. And some actual or possible errors have been flagged with comments during previous stages of annotation.

What We're Doing Now

In this, our first systematic fixup effort (October 2004), we're concentrating primarily on the last group, and specifically on discontinuous entity references (such as "organic acids" in the phrase "organic and inorganic acids") in the files that were entity-annotated before we developed the chaining tool. We knew then that we would want to capture these, so we put the relevant information in the comment field, like this:

text label comment
organic Substance ... acids
inorganic acids Substance (none)
acids Substance organic...

Now that we have the chaining tool, we are annotating these discontinuous references properly. To do this, we are looking at all the files that have comments in them. Many of these comments are not related to chains, but as an incidental effort we are also prepared to fix some other kinds of problems. Read on.

Looking at comments

Eric has produced lists of all the comments in these fixup files, separated by domain, and we now have web pages to help you read those lists:

Oncology CYP450
web page web page
text listings text listings

Eric's comment reference files have the same name as the source file with an extra "EXTRACT.txt" extension. The files contain the Span, Comment, Type, Annotator, and Text from the file respectively.

Each of the "text listings" links goes to a directory containing a separate text file of comments for each file in fixup (e.g., source_file_8722_28138.src.ann.EXTRACT.txt), as well as a zipfile of all the text files which you can download as a whole. The files contain the Span, Comment, Type, Annotator, and Text from the file respectively.

The web pages contain the same information, organized somewhat differently. Each of the web pages starts with a list of all the files for fixup in this domain, referenced by source ID: source_file_8722_28138.src appears as 8722. Clicking on that number will take you to the appropriate section of the file, where each line in Eric's original output is shown as a single line of the table, in five columns: span, comment text (in italics), entity label (in boldface), annotator name, and entity text (typewriter/monospace font). Each section of the table corresponding to a single source file begins with a colored bar spanning the width of the table showing the name of the annotation extract file, with a link to take you back to the top of the page. The files are in numerical order, so you may find it simpler to just keep going up or down the page.

Once you have checked out a bunch of files for fixup, you can look at their comment lists even before you open the files in Wordfreak.

Please let us know if you have questions, and if this helps you find the comments or if it's just a pain in the butt.

Contact

Mailing list

Individual contact information

CYP entity annotators

Christopher Wright
chwright@sas.upenn.edu
AIM: chwright11
973 641-7182

Hareesh Chandrupatla
hareesh@ccat.sas.upenn.edu

Nilay Shah
nilay@seas.upenn.edu
AIM: nilaypshah

Rachel Swetz
rchoi@mail.vet.upenn.edu
215 386-1562

Sanipa Koetsawasdi
sanipa@drexel.edu
sanipa@hotmail.com
215 382-1500 (h)

Oncology entity annotators (pass 1 only)

Karen Rudo
rudo@seas.upenn.edu
AIM: aylamarguerida

Nadeene Francesco
nadeene@comcast.net
AIM: nadeenef

Others

Mark Mandel, Research Administrator
mamandel@ldc.upenn.edu
AIM: AnnotationMark
215 898-0328

Dalal Zakhary, CYP Lead Annotator 
zakhary@ldc.upenn.edu

Ramez Zakhary, Oncology Lead Annotator
rzakhary@unagi.cis.upenn.edu 
215 898-1955

Eric Pancoast, Programmer - Analyst
edp23@seas.upenn.edu

Seth Kulick, IRCS researcher
skulick@linc.cis.upenn.edu
http://www.cis.upenn.edu/~skulick/


Annotators' home

2004-11-09