POS and the Document

From BioIE Wiki

Jump to: navigation, search

Main Page  : POS  : POS and the Document


Contents

POS and Pretagging

Three layers of annotation — Paragraph, Sentence/Section, and Token — are added during pretagging. While the automatic tagger should put these tags in proper relation to each other, WordFreak does not enforce this, so it's important to check.

Nesting of Units

Every non-whitespace character in the file should be in a Token, every Token should be in a Sentence or a Section, and every Sentence or Section should be in a Paragraph. (If you think of the larger units as the parents of the next-smaller ones, there shouldn't be any orphans.) There can be no overlapping: the "child" must be completely contained within the "parent".

Within the non-biomedical text (journal name, author name(s), research institution name, PMID number), correct tokenization is not an issue; that is, every non-whitespace character has to be in a token, but the boundaries between the tokens are not our concern. This is discussed in more detail in the Pretagging Guide under Biomedical Text and Otherwise and Non-biomedical Material Embedded in Text.

Parts of Speech and Tokens

Part of speech tags are applied to tokens. Within WordFreak and this project, every token is exactly one POS and every POS is exactly one token. We use separate programs to divide the text into tokens and to apply POS tags to those tokens before the POS annotators review them. In WordFreak POS annotation, a token that has not received a POS tag or that has lost its POS tag is tagged as "token".

Entity Boundaries

An entity may contain any number of tokens, but a token must not contain more than one entity. This rule is different from the other nesting rules because not all the text is contained in entities; indeed, most of it is not. WordFreak will not let you specify a token that crosses a sentence or section boundary, but it does not yet enforce this rule for entities, so it must enforced by annotators. Tokens should not be created whose span crosses either edge of an entity. If a POS or token tag is marked as being assigned by an annotator rather than a tagger, as shown in the status line at the bottom of the main Wordfreak window, annotators should look at the entity annotation (oncology or CYP450 in the Wordfreak menu) before changing it.

Chained Entities

This may come up in several kinds of situation, but chaining is the most likely cause of such a problem, as in the following (nonsense) example:

AFX HYPH CC AFX JJ NNS
stereo - and iso metric alleles
entity chain
entity chain


In tagging this string, the entity annotators would have had to split "isometric" into two tokens in order to tag "stereometric alleles" as a chain of two links: "stereo (+) metric alleles". POS annotation must respect those token splits. Don't leave the text with an entity boundary inside a token, e.g., "isometric" must not be a token enclosing the left boundary of "metric alleles".

Parentheses in Entities

Q: I wanted to tag this as LRB NN RRB, where the NN in that sandwich is the AUC0-24(P). I looked at the CYP annotation just to make sure, and the entity annotator had tagged everything but the last square bracket as a single entity! Can that be right?

[AUC0-24(P)]


A: No. Go into the entity annotation for CYP 450, select each of those entities, and use the shrink-left button in the chooser window to pull the entity's left boundary off the left bracket. Then POS-annotate those strings of text.

In general, assume that parentheses, and brackets, are supposed to be balanced. Also assume that in general an entity can be contained in parentheses or brackets, but should not include matching parentheses or brackets at its beginning and end. (But be sure to check for chained entities.)

In the following examples, underlining shows what the entity annotator has marked as a single entity. Validly included material is shown in blue, and invalidly included parentheses are shown in red. Everything said here about parentheses also applies to [square brackets], which we have seen in the biomedical texts, and potentially also to {curly braces} and <angle brackets>, which we have not seen yet.

The following are reasonable entity forms:

  1. (abcd)
  2. abc(def)ghi
  3. (a)bcdef
  4. abc(d)

and even

  1. (ab)cde(fgh)
but not this:

  1. (abcd)
or any of of these (all with unbalanced parentheses):

  1. (abcdef
  2. abcdef)
  3. (ab)cdef)
  4. (abcd(fg)

In Example 1, the parentheses contain the entity string but are not part of it which, is correct. In example 2, they are entirely contained within the expression, so annotators would ignore them. In Examples 3 and 4, the beginning or ending parenthesis is balanced by a matching parenthesis within the expression, so it has to be included to balance its mate.

Example 5 looks as though it includes enclosing parentheses, but each of those marginal parentheses is balanced by a matching parenthesis within the expression. Since the "(" at the left end of the string matches a ")" embedded in the string, they are both part of the entity, and similarly for the ")" at the right end of the string.

In Example 6 the entity string contains balanced parentheses, but they are at its boundaries. This should be tagged as in Example 1.

Examples 7-10 have unbalanced parentheses and should be assumed to be incorrect as entities. In each of these cases, presumably, the unmatched marginal parenthesis belongs to the sentence, or possibly to a larger piece of fruit salad that this entity is embedded in. If an annotator sees something like these, it should be corrected as described above and reported to the list, with the text in question and the file's source ID and PMID.

Parentheses in Chained Entities

Before correcting an unbalanced parenthesis in an entity annotation, be sure to check for chains. When an entity annotator uses chaining on a split coordination, individual links may include unmatched parentheses that are balanced, and therefore legitimate, within the entity string as a whole. For example:

NN CD CC CD NN
1-(1-Benzofuran- 2 and 3 -yl)-2-mesitylethanone
entity chain 1-(1-Benzofuran- (+) 3 (+) -yl)-2-mesitylethanone
chain entity 1-(1-Benzofuran- (+) 2 (+) -yl)-2-mesitylethanone

Such a situation may be apparent from the text, as in this made-up example (although at least chain 1 is a real chemical name), or it may not show up until you look at the entity annotation view. (Since each of these links is itself fruit salad, you would tag each of them as NN, not AFX.)

False Break

Once in a while, a line or space break will show up in files. Sometimes a very long chemical name is longer than the line length of the text, and a line break comes into the middle of it -- equivalent to a space. You can usually recognize this, but DON'T try to correct it or to tag the parts as a single piece; you might guess wrong. Tag them both as NN, and add the comment "false break?" to both. For example,

a new nonsteroidal aromatase inhibitor, R 76 713

(6-[(4-chlorphenyl)(1H-1,2,4-triazol-1-yl)-methyl]-1-methyl-1H-

benzotriazole)

should be tagged as:

R_NN 76_CD 713_CD

(_-LRB 6-[(4-chlorphenyl)(1H-1,2,4-triazol-1-yl)-methyl]-1-methyl-1H-_NN(false break?)

benzotriazole_NN )_-RRB


False line breaks can also occur in regular text. These are like the false breaks we've seen in very long chemical terms, but in paragraphs instead: sometimes a blank line gets embedded in the middle of a paragraph of continuous text,


as I just did here on purpose. In this case, include the empty line in the paragraph tag, and make a "false break" comment in the Chooser window.

Ungrammatical Expressions and Typos

These can sometimes be due to translation or to text written by non-native English speakers. Tag them as they appear in the text, then make a comment in the Chooser window. For example:

Vinyl chloride (VC) is a know animal and human carcinogen associated with liver angiosarcomas.

Tag "know" as VB (even though it should be "known"_VBN) and make a comment.

Abbreviations

Abbreviations and initials should be tagged as if they were spelled out, see (Santorini, section 5.4, p. 32), and FW for non-English. For example:

  • e.g. ("exempli gratia"= "for example")
e.g._FW


  • e. g. [with a space between e and g]
e._FW g._FW


  • i.e. ("id est" = "that is")
i.e._FW


  • i. e. [with a space between i and e]
i._FW e._FW


  • s.d. ("standard deviation"): "standard deviation" would be tagged JJ NN, but since there is no space we can only assign a single POS tag. We tag it as the POS appropriate to the head of the phrase: the word that the rest of the phrase modifies, the word whose function in the sentence dictates the function of the entire phrase: here, "deviation".
s.d._NN


Note that many of these can also occur without periods:

a mean +/- SD of 54.2 +/- 29.2 pmol/min/mg


Abbreviations with Variable POS

Some biomedical abbreviations can stand for either the adjectival or the adverbial form of the word:

s.c. ("subcutaneous(ly)")

i.v. ("intravenous(ly)")

i.m. ("intramuscular(ly)")

i.p. ("intraperitoneal(ly)")

(These can also occur without periods: "When hCG (5 IU) was administered sc and the follicles were isolated 3 h later...".)


To tag these correctly you must look at the context:

s.c._JJ injections

4.0 mg injected sc_RB

were injected with XYZ (ip_JJ 2.5 mg)

In the last of these examples you can't use the usually reliable technique of reading it out loud to hear which sounds right, the adjective or the adverb, because the syntax is notational rather than English. The text in parentheses, though, modifies the substance being injected, not the act of injection -- compare "as chloride", "in saline solution" -- so we use JJ there.

Unit Abbreviation Attached to Number

Sometimes in a measurement the symbol for the unit is attached directly to the number of units. Break these up, tagging the number as CD and the unit symbol as NN (or possibly NNS):

72hr  (meant to be read as "72 hours")
72_CD hr_NN

72hrs 72_CD hrs _NNS

0.5(-6)M (="0.5 x 10-6 Mol"; see here) 0.5(-6_CD) M_NN

Singular vs. Plural, and Plurals with "'s"

All abbreviations for measurements, such as "mm" (millimeter(s)), "nM" (nanomole(s)), and "kDa" (kilodalton(s)), are singular (NN), except for the few like "ins." or "lbs." that are explicitly plural nouns (NNS).

We have seen a few instances of abbreviations pluralized with 's:

Vmax'S [sic]

KM's

Ki's

P-450's

ED50's

Tag the entire string as an NNS. If there are multiword entity names pluralized in this way, break at white space as we have been doing all along.

Parentheses in Abbreviations

Many terms referring to measured or calculated values are symbolized with abbreviations that include parentheses. Tag such symbols as NN; do not split them. Of course, if the entity annotators have tagged an entity within such a symbol you will have to split it.

The list includes, but is not limited to:

EC(50)

IC(50)

K(I)

K(i)

k(inact)

k(inactivation)

K(M)

K(m)

V(max)

You will see these much more often in the CYP files than in the oncology files, because they are part of what the CYP researchers are looking for. Many of them also occur without the parentheses. The ones on the list above should not include tagged entities, but other similar symbols may do so, for example, a symbol referring to the concentration of a particular compound.

Biomedical Conventions

Amino Acid Substitutions

There is a standard format for representing amino acid and nucleotide substitutions, consisting of either one letter or three letters, then one or more digits, then again either one or three letters (the same number as in the first part). The three-letter amino acid symbols are usually, but not always, cased as capital, small, small.

  • Three-letter amino acid notation:
Ser726Pro


  • One-letter amino acid notation:
S276P


  • One-letter nucleotide (base pair) notation:
G35A


The oncology entity taggers have split these up into letter sections and number sections. POS annotators tag them them as follows:

Ser726Pro

Ser_NN 726_CD Pro_NN

S276P

S_NN 276_CD P_NN

G35A

G_NN 35_CD A_NN

Amino Acid Symbols

The following list includes twenty essential nucleic acids and some other amino acids and symbols.

Full name       3-letter code 1-letter code 
                             
alanine                     Ala                    A  
arginine                    Arg                    R 
asparagine                  Asn                    N  
aspartic acid               Asp                    D  
cysteine                    Cys                    C  
gamma-carboxyglutamate      Gla
glutamate or glutamine      Glx
glutamine                   Gln                    Q 
glutamic acid               Glu                    E 
glycine                     Gly                    G  
histidine                   His                    H  
homoserine                  Hse
hydroxylysine               Hyl
hydroxyproline              Hyp
isoleucine                  Ile                    I 
leucine                     Leu                    L 
lysine                      Lys                    K 
methionine                  Met                    M 
ornithine                   Orn  
phenylalanine               Phe                    F 
proline                     Pro                    P 
pyroglutamic acid           Pyr
sarcosine                   Sar
serine                      Ser                    S 
threonine                   Thr                    T  
tryptophan                  Trp                    W
tyrosine                    Tyr                    Y 
valine                      Val                    V  
(unspecified amino acid) Xaa any *** gap of indeterminate length --- translation stop TGA translation stop TAG translation stop TAA

Nucleotide symbols

A -- adenine
G -- guanine
C -- cytosine
T -- thymine (DNA only)
U -- uracil (RNA only)

Strings of nucleotides

Often we see strings of the letters G, C, A, and T, representing strings of nucleotides on one strand of DNA (or RNA, with U instead of T): "AGTTCA". Don't split these up. The tokenizer will usually make one token of it; leave it that way and tag the whole string as a single noun.

Complex Chemical Words (i.e., "fruit salad")

A chemical term with internal punctuation, such as "2,3,7,8-tetrachlorodibenzo-p-dioxin" should be tagged as a single token.

Fruit Salad and POS

"Fruit salad" is our nickname for chemical terms like this that include strings that aren't normally found in words such as numbers, commas, hyphens, parentheses, and Greek letter names. Fruit salad is usually an NN, but other parts of speech can also be "fruit salad":

N-(3,5-dichlorophenyl)-2-hydroxysuccinamic_JJ acid_NN

The -ic ending makes the word an adjective.


the 2[N]-methylated compounds

Since both of the parts of this hyphenate are extremely technical words, we should consider it fruit salad and not break it up. How should we tag it? According to Santorini (pp. 15-17) and our own adaptation of her JJ or VBG/VBN rules, we would call this JJ, because there is no evidence of a verb "to 2[N]-methylate". But we do have indisputable VBNs of the same form, such as

which is N-demethylated by

was N-demethylated by

being O-demethylated to

Therefore, we have to conclude that like so many other parts of this technical vocabulary, "2[N]-methylate" is a verb constructed according to productive rules, and "2[N]-methylated" in this sentence is its VBN:

the 2[N]-methylated_VBN compounds

Fruit Salad and Tokenization

Do adjust the tokenization at the ends, if necessary. For example:

TCDD (2,3,7,8-tetrachlorodibenzo-p-dioxin)

Here, the parentheses clearly mark off a synonym for the abbreviation. If the '(' has been tokenized together with the '2', you have to separate them. The digit is part of the chemical name but the parenthesis is not. Tag as:

TCDD_NN (_-LBR- 2,3,7,8-tetrachlorodibenzo-p-dioxin_NN )_-RBR-

Non-substance Fruit Salad

Fruit salad is not restricted to the names of substances. For example, a biochemical term referring to a process can also be fruit salad. See the NN or NNS Guidelines.

N-oxidation_NN


Main Page  : POS  : POS and the Document

Personal tools