POS and the Document
From BioIE Wiki
Main Page : POS : POS and the Document
Contents |
POS and Pretagging
Three layers of annotation — Paragraph, Sentence/Section, and Token — are added during pretagging. While the automatic tagger should put these tags in proper relation to each other, WordFreak does not enforce this, so it's important to check.
Nesting of Units
Every non-whitespace character in the file should be in a Token, every Token should be in a Sentence or a Section, and every Sentence or Section should be in a Paragraph. (If you think of the larger units as the parents of the next-smaller ones, there shouldn't be any orphans.) There can be no overlapping: the "child" must be completely contained within the "parent".
Within the non-biomedical text (journal name, author name(s), research institution name, PMID number), correct tokenization is not an issue; that is, every non-whitespace character has to be in a token, but the boundaries between the tokens are not our concern. This is discussed in more detail in the Pretagging Guide under Biomedical Text and Otherwise and Non-biomedical Material Embedded in Text.
Parts of Speech and Tokens
Part of speech tags are applied to tokens. Within WordFreak and this project, every token is exactly one POS and every POS is exactly one token. We use separate programs to divide the text into tokens and to apply POS tags to those tokens before the POS annotators review them. In WordFreak POS annotation, a token that has not received a POS tag or that has lost its POS tag is tagged as "token".
Entity Boundaries
An entity may contain any number of tokens, but a token must not contain more than one entity. This rule is different from the other nesting rules because not all the text is contained in entities; indeed, most of it is not. WordFreak will not let you specify a token that crosses a sentence or section boundary, but it does not yet enforce this rule for entities, so it must enforced by annotators. Tokens should not be created whose span crosses either edge of an entity. If a POS or token tag is marked as being assigned by an annotator rather than a tagger, as shown in the status line at the bottom of the main Wordfreak window, annotators should look at the entity annotation (oncology or CYP450 in the Wordfreak menu) before changing it.
Chained Entities
This may come up in several kinds of situation, but chaining is the most likely cause of such a problem, as in the following (nonsense) example:
| AFX | HYPH | CC | AFX | JJ | NNS | |||
| stereo | - | ♦ | and | ♦ | iso | metric | ♦ | alleles |
| entity | chain | |||||||
| entity chain | ||||||||
In tagging this string, the entity annotators would have had to split "isometric" into two tokens in order to tag "stereometric alleles" as a chain of two links: "stereo (+) metric alleles". POS annotation must respect those token splits. Don't leave the text with an entity boundary inside a token, e.g., "isometric" must not be a token enclosing the left boundary of "metric alleles".
Parentheses in Entities
Q: I wanted to tag this as LRB NN RRB, where the NN in that sandwich is the AUC0-24(P). I looked at the CYP annotation just to make sure, and the entity annotator had tagged everything but the last square bracket as a single entity! Can that be right?
[AUC0-24(P)]
A: No. Go into the entity annotation for CYP 450, select each of those entities, and use the shrink-left button in the chooser window to pull the entity's left boundary off the left bracket. Then POS-annotate those strings of text.
In general, assume that parentheses, and brackets, are supposed to be balanced. Also assume that in general an entity can be contained in parentheses or brackets, but should not include matching parentheses or brackets at its beginning and end. (But be sure to check for chained entities.)
In the following examples, underlining shows what the entity annotator has marked as a single entity. Validly included material is shown in blue, and invalidly included parentheses are shown in red. Everything said here about parentheses also applies to [square brackets], which we have seen in the biomedical texts, and potentially also to {curly braces} and <angle brackets>, which we have not seen yet.
The following are reasonable entity forms:
- (abcd)
- abc(def)ghi
- (a)bcdef
- abc(d)
and even
- (ab)cde(fgh)
- (abcd)
- (abcdef
- abcdef)
- (ab)cdef)
- (abcd(fg)
In Example 1, the parentheses contain the entity string but are not part of it which, is correct. In example 2, they are entirely contained within the expression, so annotators would ignore them. In Examples 3 and 4, the beginning or ending parenthesis is balanced by a matching parenthesis within the expression, so it has to be included to balance its mate.
Example 5 looks as though it includes enclosing parentheses, but each of those marginal parentheses is balanced by a matching parenthesis within the expression. Since the "(" at the left end of the string matches a ")" embedded in the string, they are both part of the entity, and similarly for the ")" at the right end of the string.
In Example 6 the entity string contains balanced parentheses, but they are at its boundaries. This should be tagged as in Example 1.
Examples 7-10 have unbalanced parentheses and should be assumed to be incorrect as entities. In each of these cases, presumably, the unmatched marginal parenthesis belongs to the sentence, or possibly to a larger piece of fruit salad that this entity is embedded in. If an annotator sees something like these, it should be corrected as described above and reported to the list, with the text in question and the file's source ID and PMID.
Parentheses in Chained Entities
Before correcting an unbalanced parenthesis in an entity annotation, be sure to check for chains. When an entity annotator uses chaining on a split coordination, individual links may include unmatched parentheses that are balanced, and therefore legitimate, within the entity string as a whole. For example:
| NN | CD | CC | CD | NN | |||
| 1-(1-Benzofuran- | 2 | ♦ | and | ♦ | 3 | -yl)-2-mesitylethanone | |
| entity | chain | 1-(1-Benzofuran- (+) 3 (+) -yl)-2-mesitylethanone | |||||
| chain | entity | 1-(1-Benzofuran- (+) 2 (+) -yl)-2-mesitylethanone | |||||
Such a situation may be apparent from the text, as in this made-up example (although at least chain 1 is a real chemical name), or it may not show up until you look at the entity annotation view. (Since each of these links is itself fruit salad, you would tag each of them as NN, not AFX.)
False Break
Once in a while, a line or space break will show up in files. Sometimes a very long chemical name is longer than the line length of the text, and a line break comes into the middle of it -- equivalent to a space. You can usually recognize this, but DON'T try to correct it or to tag the parts as a single piece; you might guess wrong. Tag them both as NN, and add the comment "false break?" to both. For example,
(6-[(4-chlorphenyl)(1H-1,2,4-triazol-1-yl)-methyl]-1-methyl-1H-
should be tagged as:
(_-LRB 6-[(4-chlorphenyl)(1H-1,2,4-triazol-1-yl)-methyl]-1-methyl-1H-_NN(false break?)
False line breaks can also occur in regular text. These are like the false breaks we've seen in very long chemical terms, but in paragraphs instead: sometimes a blank line gets embedded in the middle of a paragraph of continuous text,
as I just did here on purpose. In this case, include the empty line in the paragraph tag, and make a "false break" comment in the Chooser window.
Ungrammatical Expressions and Typos
These can sometimes be due to translation or to text written by non-native English speakers. Tag them as they appear in the text, then make a comment in the Chooser window. For example:
Tag "know" as VB (even though it should be "known"_VBN) and make a comment.
Abbreviations
Abbreviations and initials should be tagged as if they were spelled out, see (Santorini, section 5.4, p. 32), and FW for non-English. For example:
- e.g. ("exempli gratia"= "for example")
- e. g. [with a space between e and g]
- i.e. ("id est" = "that is")
- i. e. [with a space between i and e]
- s.d. ("standard deviation"): "standard deviation" would be tagged JJ NN, but since there is no space we can only assign a single POS tag. We tag it as the POS appropriate to the head of the phrase: the word that the rest of the phrase modifies, the word whose function in the sentence dictates the function of the entire phrase: here, "deviation".
Note that many of these can also occur without periods:
Abbreviations with Variable POS
Some biomedical abbreviations can stand for either the adjectival or the adverbial form of the word:
i.v. ("intravenous(ly)")
i.m. ("intramuscular(ly)")
(These can also occur without periods: "When hCG (5 IU) was administered sc and the follicles were isolated 3 h later...".)
To tag these correctly you must look at the context:
In the last of these examples you can't use the usually reliable technique of reading it out loud to hear which sounds right, the adjective or the adverb, because the syntax is notational rather than English. The text in parentheses, though, modifies the substance being injected, not the act of injection -- compare "as chloride", "in saline solution" -- so we use JJ there.
Unit Abbreviation Attached to Number
Sometimes in a measurement the symbol for the unit is attached directly to the number of units. Break these up, tagging the number as CD and the unit symbol as NN (or possibly NNS):
72hr (meant to be read as "72 hours") 72_CD hr_NN
72hrs 72_CD hrs _NNS
0.5(-6)M (="0.5 x 10-6 Mol"; see here) 0.5(-6_CD) M_NN
Singular vs. Plural, and Plurals with "'s"
All abbreviations for measurements, such as "mm" (millimeter(s)), "nM" (nanomole(s)), and "kDa" (kilodalton(s)), are singular (NN), except for the few like "ins." or "lbs." that are explicitly plural nouns (NNS).
We have seen a few instances of abbreviations pluralized with 's:
KM's
Ki's
P-450's
ED50'sTag the entire string as an NNS. If there are multiword entity names pluralized in this way, break at white space as we have been doing all along.
Parentheses in Abbreviations
Many terms referring to measured or calculated values are symbolized with abbreviations that include parentheses. Tag such symbols as NN; do not split them. Of course, if the entity annotators have tagged an entity within such a symbol you will have to split it.
The list includes, but is not limited to:
IC(50)
K(I)
K(i)
k(inact)
k(inactivation)
K(M)
K(m)
V(max)You will see these much more often in the CYP files than in the oncology files, because they are part of what the CYP researchers are looking for. Many of them also occur without the parentheses. The ones on the list above should not include tagged entities, but other similar symbols may do so, for example, a symbol referring to the concentration of a particular compound.
Biomedical Conventions
Amino Acid Substitutions
There is a standard format for representing amino acid and nucleotide substitutions, consisting of either one letter or three letters, then one or more digits, then again either one or three letters (the same number as in the first part). The three-letter amino acid symbols are usually, but not always, cased as capital, small, small.
- Three-letter amino acid notation:
- One-letter amino acid notation:
- One-letter nucleotide (base pair) notation:
The oncology entity taggers have split these up into letter sections and number sections. POS annotators tag them them as follows:
Amino Acid Symbols
The following list includes twenty essential nucleic acids and some other amino acids and symbols.
Full name 3-letter code 1-letter code
alanine Ala A
arginine Arg R
asparagine Asn N
aspartic acid Asp D
cysteine Cys C
gamma-carboxyglutamate Gla
glutamate or glutamine Glx
glutamine Gln Q
glutamic acid Glu E
glycine Gly G
histidine His H
homoserine Hse
hydroxylysine Hyl
hydroxyproline Hyp
isoleucine Ile I
leucine Leu L
lysine Lys K
methionine Met M
ornithine Orn
phenylalanine Phe F
proline Pro P
pyroglutamic acid Pyr
sarcosine Sar
serine Ser S
threonine Thr T
tryptophan Trp W
tyrosine Tyr Y
valine Val V
(unspecified amino acid) Xaa
any ***
gap of indeterminate length ---
translation stop TGA
translation stop TAG
translation stop TAA
Nucleotide symbols
A -- adenine G -- guanine C -- cytosine T -- thymine (DNA only) U -- uracil (RNA only)
Strings of nucleotides
Often we see strings of the letters G, C, A, and T, representing strings of nucleotides on one strand of DNA (or RNA, with U instead of T): "AGTTCA". Don't split these up. The tokenizer will usually make one token of it; leave it that way and tag the whole string as a single noun.
Complex Chemical Words (i.e., "fruit salad")
A chemical term with internal punctuation, such as "2,3,7,8-tetrachlorodibenzo-p-dioxin" should be tagged as a single token.
Fruit Salad and POS
"Fruit salad" is our nickname for chemical terms like this that include strings that aren't normally found in words such as numbers, commas, hyphens, parentheses, and Greek letter names. Fruit salad is usually an NN, but other parts of speech can also be "fruit salad":
The -ic ending makes the word an adjective.
Since both of the parts of this hyphenate are extremely technical words, we should consider it fruit salad and not break it up. How should we tag it? According to Santorini (pp. 15-17) and our own adaptation of her JJ or VBG/VBN rules, we would call this JJ, because there is no evidence of a verb "to 2[N]-methylate". But we do have indisputable VBNs of the same form, such as
was N-demethylated by
being O-demethylated toTherefore, we have to conclude that like so many other parts of this technical vocabulary, "2[N]-methylate" is a verb constructed according to productive rules, and "2[N]-methylated" in this sentence is its VBN:
Fruit Salad and Tokenization
Do adjust the tokenization at the ends, if necessary. For example:
Here, the parentheses clearly mark off a synonym for the abbreviation. If the '(' has been tokenized together with the '2', you have to separate them. The digit is part of the chemical name but the parenthesis is not. Tag as:
Non-substance Fruit Salad
Fruit salad is not restricted to the names of substances. For example, a biochemical term referring to a process can also be fruit salad. See the NN or NNS Guidelines.
