| Annotators' home |
Many of our annotated files contain errors of one sort or another. These include, but are not limited to:
entity errors
tokenization and POS errors
Sentence and Section errors
internal errors: These occur in Wordfreak's internal storage and the XML annotation files. They are not visible through the Wordfreak interface but can mess things up anyway.
Some of these errors can be detected and fixed automatically, such as hierarchy errors. Some can be detected but not fixed, such as untokenized text, which needs tokenization and POS labeling, both with human checking. Some kinds of error can be detected as "probable error", needing a human's eye to determine if they are wrong or just unusual, and, if wrong, to correct them. Some cannot be automatically handled at all. And some actual or possible errors have been flagged with comments during previous stages of annotation.
In this, our first systematic fixup effort (October 2004), we're concentrating primarily on the last group, and specifically on discontinuous entity references (such as "organic acids" in the phrase "organic and inorganic acids") in the files that were entity-annotated before we developed the chaining tool. We knew then that we would want to capture these, so we put the relevant information in the comment field, like this:
| text | label | comment |
|---|---|---|
| organic | Substance | ... acids |
| inorganic acids | Substance | (none) |
| acids | Substance | organic... |
Now that we have the chaining tool, we are annotating these discontinuous references properly. To do this, we are looking at all the files that have comments in them. Many of these comments are not related to chains, but as an incidental effort we are also prepared to fix some other kinds of problems. Read on.
Eric has produced lists of all the comments in these fixup files, separated by domain, and we now have web pages to help you read those lists:
| Oncology | CYP450 |
|---|---|
| web page | web page |
| text listings | text listings |
Eric's comment reference files have the same name as the source file with an extra "EXTRACT.txt" extension. The files contain the Span, Comment, Type, Annotator, and Text from the file respectively.
Each of the "text listings" links goes to a directory containing a separate text file of comments for each file in fixup (e.g., source_file_8722_28138.src.ann.EXTRACT.txt), as well as a zipfile of all the text files which you can download as a whole. The files contain the Span, Comment, Type, Annotator, and Text from the file respectively.
The web pages contain the same information, organized somewhat differently. Each of the web pages starts with a list of all the files for fixup in this domain, referenced by source ID: source_file_8722_28138.src appears as 8722. Clicking on that number will take you to the appropriate section of the file, where each line in Eric's original output is shown as a single line of the table, in five columns: span, comment text (in italics), entity label (in boldface), annotator name, and entity text (typewriter/monospace font). Each section of the table corresponding to a single source file begins with a colored bar spanning the width of the table showing the name of the annotation extract file, with a link to take you back to the top of the page. The files are in numerical order, so you may find it simpler to just keep going up or down the page.
Once you have checked out a bunch of files for fixup, you can look at their comment lists even before you open the files in Wordfreak.
Please let us know if you have questions, and if this helps you find the comments or if it's just a pain in the butt.
Christopher Wright chwright@sas.upenn.edu AIM: chwright11 973 641-7182 Hareesh Chandrupatla hareesh@ccat.sas.upenn.edu Nilay Shah nilay@seas.upenn.edu AIM: nilaypshah Rachel Swetz rchoi@mail.vet.upenn.edu 215 386-1562 Sanipa Koetsawasdi sanipa@drexel.edu sanipa@hotmail.com 215 382-1500 (h)
Karen Rudo rudo@seas.upenn.edu AIM: aylamarguerida Nadeene Francesco nadeene@comcast.net AIM: nadeenef
Mark Mandel, Research Administrator mamandel@ldc.upenn.edu AIM: AnnotationMark 215 898-0328 Dalal Zakhary, CYP Lead Annotator zakhary@ldc.upenn.edu Ramez Zakhary, Oncology Lead Annotator rzakhary@unagi.cis.upenn.edu 215 898-1955 Eric Pancoast, Programmer - Analyst edp23@seas.upenn.edu Seth Kulick, IRCS researcher skulick@linc.cis.upenn.edu http://www.cis.upenn.edu/~skulick/
| Annotators' home |
2004-11-09