Pretagging
From BioIE Wiki
Main Page : Pretagging
Contents |
Introduction
This guide assumes you know how to use WordFreak. There's a User's Guide and a how-to page, but you should have been shown as well. In this text the words "Paragraph", "Sentence", and "Section" are capitalized when referring to the units in the pretagging operation, as distinct from what we might call a paragraph, sentence, or section in ordinary usage. Most times they come out the same, but sometimes the difference matters. You'll see.
Pretagging consists of running a series of automatic taggers within WordFreak and, after some of them, checking their output and correcting it where necessary. While it would be possible to do all the work of these taggers by hand, it would be very costly. The training for tokenization and part-of-speech annotation takes a long time, so pretaggers (i.e., pretagging annotators), who generally don't have that expertise, do not check the work of those taggers.
Each kind of pretagging begins by loading and running the appropriate tagger within WordFreak.
Basic Principles
Levels of Pretagging
We pretag the elementary units of a text at three levels:
- Paragraph
- Sentence (including Section, as explained below)
- Token
On this page the words "Paragraph", "Sentence", and "Section" are capitalized when referring to the units in the pretagging operation, as distinct from what we might call a paragraph, sentence, or section in ordinary usage. End-users and annotators will be familiar with the first two from everyday usage, but there are some subtle differences that will be explained below. A "token" is usually a single word or punctuation mark.
Nesting of Units
The following principles govern the nesting of units:
- Every non-whitespace character in the file must be in one and only one Token.
- Every Token must be in one and only one Sentence or Section. (The difference between a Sentence and a Section is explained below.)
- Every Sentence or Section must be in one and only one Paragraph.
If you think of the larger units as the parents of the next-smaller ones, there shouldn't be any orphans. Since we tag the largest units first and it is difficult to go back and fix a larger unit after running a tagger at a smaller scale, care is required at each stage.
Biomedical Text and Otherwise
In this project we are working on extracting biomedical information. Not all of the content of our files is biomedical, and we don't want to devote unnecessary effort to annotating the rest, but for various reasons we can't just ignore it. So which is which?
Obviously, the text of the abstract is biomedical text. Somewhat less obviously, we consider the title as biomedical text. The rest is not: citation data, authors, affiliations, PMID (PubMed identification number), and other procedural insertions:
| citation (bibliographic) information | NON-BIO |
| title | BIO |
| authors | NON-BIO |
| authors' institutional affiliations | NON-BIO |
| body | BIO |
| PubMed ID | NON-BIO |
Occasionally you'll find other types of paragraph as well, usually non-biomedical, and often in an indented list format. These are discussed further in the sections on paragraph and sentence tagging. Even if they contain biomedical terms, such as a list of keywords, they are not part of the abstract title or body, so for our purposes they don't count as biomedical text. In addition, non-biomedical material can be found in the text body or title. See Non-biomedical Material Embedded in Text below.
Paragraph Tagging
Use the Paragraph tagger.
The logical divisions of the PubMed abstracts usually look like paragraphs: blocks of text separated by blank lines (but watch out for false breaks). Tag all such blank-line-delimited blocks of text as separate Paragraphs, even in non-biomedical parts of files. For example, make sure the first and second paragraphs are separate; the Paragraph tagger often combines them. Sometimes the tagger misses the final period, right square bracket, or other final character of the Paragraph; check for that and fix it if necessary.
False Paragraph Breaks
False line breaks can occur within paragraphs: sometimes a blank line gets embedded in the middle of a paragraph of continuous text, as
in this case, where I did it on purpose. False breaks are easy to overlook (you may find it helpful to increase the font size). Don't start a new Paragraph at a false break; instead, include the empty line within the Paragraph, and in the Chooser window enter "false break" in the Comment field for the Paragraph.
Real Paragraph Breaks
Sometimes a logical division of the file, such as the abstract body, contains a blank line that really is a paragraph break: it doesn't come in the middle of a word or sentence. Make sure the blank line is not just an artifact of the view in WordFreak by making the window a little wider. If the blank line is really in the text, tag two Paragraphs.
List Paragraphs
Conversely, some abstracts contain a list of short items, each on its own line but not separated by blank lines. These are usually not biomedical in content:
1.
Erratum in:
Hum Pathol 2002 Mar;33(3):379
2.
Comment in:
Am J Surg Pathol. 2002 Mar;26(3):396, discussion 396.
Am J Surg Pathol. 2002 Mar;26(3):396-7; discussion 397-8.
3.
Publication Types:
Review
Review of Reported Cases
As usual, treat everything between one blank line and the next as a single Paragraph. So each of the above numbered examples forms a single Paragraph. (Within each list, the heading and each line in the list should be tagged as a separate Sentence or Section; discussion below.) If you do find such lists in biomedical text (title and body), treat them the same way.
Sentence/Section Tagging
Use the Bio Sentence tagger
The labels Sentence and Section are left over from some earlier project, and nobody remembers what they were meant for. We're hijacking them to make a useful and needed distinction:
- Biomedical text is tagged with Sentences.
- Non-biomedical text is tagged with Sections.
What's more, we don't care about the division into Sections, or even into Tokens, in the non-biomedical text, as long as proper nesting is maintained -- especially nesting rule #1, "every non-whitespace character must be in a Token". Applying this rule to the list of biomedical and non-biomedical parts of the file, above, means that the (small-s!) sentences in the Paragraphs of a typical PubMed abstract are tagged as follows:
- citation (bibliographic) information: Section
- title: Sentence(s)
- authors: Section
- authors' institutional affiliations: Section
- body: Sentences
- PubMed ID: Section
The boldface curly braces in the following examples are not part of the text, but mark the boundaries of Sentences or Sections, as described for each example.
Sentence Fragments
A big-S Sentence doesn't have to be a small-s sentence. For example, this title:
Even though it ends with a period, grammatically it's not a sentence, just a noun phrase with lots of internal structure. But we tag it as a capital-S Sentence because the tokens in it must be inside either a Sentence or a Section, and since the article title, it's biomedical text, so we use Sentence. It would be a Sentence even if there were no period.
Titles with More Than One Sentence
A title can contain two or more sentences, which don't all have to be grammatical sentences. Often none of them are:
Tag this title as containing two sentences, the first one ending with the first period and the second one beginning with "Analysis".
Subheads or Labeled Lists
Inline headings such as "BACKGROUND:", "METHODS", and "CONCLUSIONS" are to be tagged as Sentence. So should numbered (or lettered) headings like "1. ... 2. ... 3.", "a) ... b) ... c)", etc., unless they are used to label phrases within a single sentence. Include any colons, periods, or other punctuation.
Subheads as Separate Sentences
- Subheads introducing whole sentences: tag as separate Sentences:
{FINDINGS:} {She was right.}
There are four Sentences in the above text:
- STUDIES: [including the colon]
- Grandmother said fruit and vegetables were good for you.
- FINDINGS:
- She was right.
Don't try to break down a complex subhead like this one, even though it has an internal colon as well as a final one. Tag from the first uppercased word through the last, and any included punctuation as well as the final punctuation (here, a colon), as a single Sentence. So there are just two Sentences here:
- BREAST CANCER: HIGH PREVALENCE AND RISING INCIDENCE: [including both colons]
- Breast cancer is ...
{2.} {The 4-hydroxycoumarins follow similar metabolic routes and are...}
There are four Sentences:
- 1. [including the period]
- The effects ... investigated.
- 2.
- The 4-hydroxycoumarins follow...
Subheads Within a Single Sentence
Tag as part of that Sentence
All one Sentence.
Likewise all one Sentence.
Sections in List Paragraphs
In paragraphs in a list format, where the heading and the items are separated by line breaks, the heading and each item should be a separate Section, even when there is only one item. The items may or may not be indented. So, to revisit the examples used above:
1.
{Erratum in:}
{Hum Pathol 2002 Mar;33(3):379}
2.
{Comment in:}
{Am J Surg Pathol. 2002 Mar;26(3):396, discussion 396.}
{Am J Surg Pathol. 2002 Mar;26(3):396-7; discussion 397-8.}
3.
{Publication Types:}
{Review}
{Review of Reported Cases}
In each of these Paragraphs, tag each line as a separate Section: two in example (1), three each in (2) and (3). If the content were biomedical text -- for example, a list of cell lines as part of the abstract body -- you would use Sentence rather than Section.
You may find a list Paragraph in which one or more of the list items runs to more than one physical line, as in this made-up example:
{Comment in:}
{Am J Surg Pathol. 2002 Mar;26(3):396, discussion 396.}
{Am J Surg Pathol. 2002 Mar;26(3):396-7; discussion 397-8.}
{Am J Surg Pathol. 2002 Mar;26(3):398; discussion 398-9; reply 399-400;
rebuttal 400-401; counterrebuttal 401-403; rejoinder 403; counterrejoinder
404-407; general mudslinging, accusations, and libel 407-410; court order,
410-422.}
Here the format, indentation, and content show that physical lines 4-7 of the paragraph constitute a single item in the list, and so should form a single Section. In all, there are four Sections:
- the heading, physical line 1, "Comment in:"
- the first comment and discussion, physical line 2, "Am J ... 396."
- the second comment and discussion, physical line 3, "Am J ... 399-400."
- the third comment and ensuing brawl, physical lines 4-7, "Am J ... 410-422."
Non-biomedical Material Embedded in Text
A fairly common exception to the rule of tagging everything in the abstract as Sentence is bibliographic or procedural information embedded in the text. For example:
{Text text text text.} {(ABSTRACT TRUNCATED AT 250 WORDS)}
{Text text text text.} {Copyright 2001 Wiley-Liss, Inc.}
{Text text text text.} {[T. Suzuki et al., J. Med. Chem., 42: 3001-3003, 1999]}
Here "(ABSTRACT TRUNCATED AT 250 WORDS)", "Copyright 2001 Wiley-Liss, Inc.", and "[T. Suzuki et al., J. Med. Chem., 42: 3001-3003, 1999]" are to be tagged as Sections even though they are within the abstract body Paragraph.
It can happen in the title, too:
The name of the study group may have been mistakenly attached to the title instead of being associated with the authors' names or affiliations. Since it is separated from the first part of the title by a period, we can safely make it a Section. If there were a colon instead, we would be forced to make the whole title a single Sentence.
Save
After checking and correcting the Sentence/Section tagging, save your work.
Token Tagging
Use the Bio Token tagger
Make sure that everything in your files, including the non-biomedical parts, is included as a Token as well. Within the non-biomedical parts correct tokenization is not an issue; every non-whitespace character has to be in a token, but the boundaries between the tokens are not our concern.
We still sometimes see a serious error in WordFreak in which the tokenization looks OK in the first few paragraphs, but suddenly (usually in the abstract body) there's only a token here and there, with most of the text untokenized. A spot-check of each paragraph will detect this problem. Click in the middle of the paragraph, then use any of the keyboard arrow keys to move quickly through the next or previous few tokens. If the highlight skips any text, this problem has occurred. The best fix is to quit WordFreak without saving the file and reload your last saved version, with the corrected Sentence/Section tagging. You did save after each stage, didn't you?
Check In
Now your pretagging work is done. Make a ZIPfile of the .ann file(s) and the source file(s) and check it in through the LAW site. WordFreak creates temporary files with the extension .ann~; these can be useful in recovering from a crash, but once you are satisfied with your annotation you can delete them. Don't include them in the checkin. (It wouldn't cause any harm, but it just wastes space and bandwidth.)
Main Page : Pretagging
