Entity
From BioIE Wiki
Main Page : Entity
Introduction
These are the general guidelines for biomedical entity tagging which refers to the process of tagging text strings that are references to entities. In addition, each domain has its own page of definitions and guidelines, listed under "Entity" on the Main Page and here as well:
Contents |
Basic Principles
Reardless of the domain, we maintain a few basic principles in entity tagging, and WordFreak generally enforces them
One Sentence Rule
An entity reference cannot cross a sentence, section, or paragraph boundary. It must be within one sentence, even if it is a chain. WordFreak enforces this rule for simple references, but not for chains, so we have to enforce it on ourselves in that instance.
Tag within Tag
One entity reference can be completely included in the range of another. However, as a general rule, we allow tags within tags only
- in the "information gathering mode" and
- in domain specific cases where they are specifically permitted or required.
The smaller tag may occur at the beginning of the larger one, in the middle, or at the end.
****** ------------------ |
******
------------------
|
******
------------------
|
The general procedure, however, is to tag a reference that includes a reference to another entity as a whole phrase. This would be the case for a phrase such as "ras signal transduction mediators"; "ras" and the entire phrase are both gene entities but we will not tag the embedded entity reference,"ras", in order to tag the phase in its entirety.
Tag Within Tag and "Information Gathering Mode"
If a reference that we are tagging in "information gathering mode" includes references to entities that we are specific about, then we will tag the inner entities. For example, "Ewing's sarcoma" is tagged as Malignancy-type and within it "sarcoma" is tagged as Malignancy-histology.
There are several reasons for this decision:
- It can be difficult for annotators to analyze the structure of a complex phrase without expert domain knowledge.
- The tagged entities will eventually have pointers into databases containing such information as "ATPase is an enzyme" and "voriconazole is a triazole antifungal agent". Such databases will provide much of the information about embedded entity references.
- Many complex entities references are not names that can easily be found within a list, but are constructed phrases a domain specialist would understand such as "cytochrome P50-dependent arachidonate metabolism". Such expressions can be harvested from the tagged references not found in databases and analysed so automatic parsers can be developed.
- References tagged in "information gathering mode" will be subjected to human analysis and later revisited under stricter rules. Until we know what we are tagging (and not tagging) for such categories, we will tag references to entities of categories within them.
Double tagging
There are specific cases when we give the same piece of text two entity tags that exactly coincide.
********** ---------- |
For example, in "deletion of the K-ras gene", "K-ras" is tagged as a Gene-RNA. It is also tagged as Variation-location because the entire gene rather than a part of it is the object of the variation -- a special rule.
Overlapping Tags
Tags are not allowed to overlap in such a way that they have a part in common and each of them also has a part not shared with the other. The following examples illustrate this point:
| NO! |
********** ========== |
| NO! |
********** ========== |
WordFreak does not allow overlapped tags. This is a feature, not a bug.
Category Names and Definitions
The definition of each category corresponds only approximately to the normal biomedical definition of the name of the category (e.g., the "Gene" category within the oncology domain also includes proteins, and the "Malignancy" categories includes benign tumors and precancerous states).
The problem is that for our purposes we need to have categories that are broader than the usual definitions and there are no convenient existing names for them. To compensate, we have to use a name that covers the majority of the kinds of entity we're including. It is worth noting that names of the categories are approximations, not to be taken in their strict normal biomedical meaning.
These are not the only categories of entity we intend to tag during the course of this project, but they are the ones that we initailly used. Some of the categories for future consideration are Organism (e.g. human, rat, blood, etc.), Body Part (e.g. heart, kidney, etc.), Virus (e.g. HIV, EBV) and Cell Line (e.g. HeLa, NIH, 3T3, human bone marrow culture, etc.). Even over the whole course of the project we don't intend to tag everything. So don't tag these and don't worry about not tagging them.
Tagging and Text Length Considerations
In routine tagging, we expect to tag as much of the text as necessary to capture the information about the entity. Many entities are named entities because they have standard or fairly standard names that are part of the working vocabulary of a domain specialist and may be easily discerned in lists. Some examples are:
gastrointestinal stromal tumor (GIST)
voriconazole
Other examples, such as phrases with modifiers, are less clear. Examples of these terms are as follows:
metastatic colon cancer
right-sided colon cancerDecisions regarding the treatment of these terms are decided by domain specialists and the types of modifiers to include or exclude are discussed on a case-by-case basis.
"Information Gathering Mode" vs. "Defined Mode"
For some categories, we are in "information gathering mode". We include in the tag a string of text which allows the domain experts to analyze different descriptions of that type of entity and decide how to treat it. The oncology annotation group did this with "variation" category. After analysis we subdivided that category into five or six subcategories, which are for the most part clearly defined and no longer being tagged in information-gathering mode. "Defined mode" is the basic style of tagging, using a (more or less) clear category definition and limiting the tag to a (more or less) restricted section of text.
Abbreviations
Often we see a long name followed by an abbreviation, either set off by a dash or set within parentheses. If the name refers to a kind of entity that is being tagged in your domain, tag the name and abbreviation separately. They will be considered as two names for the same entity. Omit the parentheses. The following example is an illustration of such a term:
When evaluating text length in the context of tagging lengthy terminology, factor in the abbreviation. The fact that the authors used an abbreviation for a phrase (even if they invented the abbreviation) is strong motivation to tag the phrase in its entirety, even if it includes modifiers or other terms we would otherwise exclude. Here, since the abbreviation includes "M", the authors intended "monoclonal" to be part of the entity name.
Similarly:
Depending on domain-specific guidelines, a modifier such as "selective" might be excluded from the tagging string. Here, however, it is part of the phrase which is represented by the first "S" in "SSRIs" and as such, is sufficient reason to consider "selective" as part of the full name of the entity.
There are also instances when a name/abbreviation pair is part of a longer term which continues after the parenthesized abbreviation. In such cases, we do not try to separate out the abbreviation. Separating out the abbreviation would create a "tag within tag" situation. Instead, the longer name is tagged in its entirety. For example:
Here we would tag the entire expression and would not tag the abbreviation separately.
General Terms
If you see a general term such as "tumor", "cancer", "gene", etc., evaluate the following:
- See if it is part of a more specific phrase that still qualifies within the category. If it does, tag that phrase with the appropriate category.
- Sometimes a general term is added as a way of telling the reader "By the way, this entity belongs to this general class"; the general name appears near the specific one, but it's not part of the specific name. In such cases, don't include the general name in the tag unless it's necessary for distinguishing different entities that could otherwise be confused with each other.
For example, the same name can refer to a gene or its product. The specifics may vary by domain.
- The oncology entity annotators have separate tags to distinguish genes from proteins. The labels carry the distinction; the general term "gene" or "protein" is redundant.
- "the K-ras gene"
tag just "K-ras", with the Gene/RNA label - "the K-ras protein "
tag "K-ras" with the Protein label
- "the K-ras gene"
- On the other hand, CYP450 entity annotators classify proteins as Substances and don't tag genes at all. With the protein, they need to include the word "protein" to make it clear that this reference is to a Substance.
- "the K-ras gene"
no tag because CYP doesn't tag genes - "the K-ras protein"
tag the whole phrase, "the K-ras protein", with the Substance label
- "the K-ras gene"
- The oncology entity annotators have separate tags to distinguish genes from proteins. The labels carry the distinction; the general term "gene" or "protein" is redundant.
- If the general word is not part of a more specific phrase, does it refer in context to a specific entity or group of entities (not necessarily a small group), or does it refer to all the things of its kind?
If it refers to specific instances of the type of entity, tag it; if it's completely general, don't. ("link to Wait for Diagnosis" covers many of these cases, expressions like "these genes".) Introductory sentences and sections often contain general uses as the writer sets the stage. This may be a hard line to draw; if in doubt, tag it. If you find yourself in doubt often, ask.
- In a similar vein, annotators have asked whether a class can be a member of the same category as its members. In general, the answer is yes. With the classes we're using, a class of things would generally be in the same category as any of its members or subclasses. So, the "gene" category in oncology tagging includes both K-ras (a gene), Ras (a family of genes), and the "malignancy" category includes both muscle tumors (a class of tumors) and smooth muscle tumors (a subset of the first class). In CYP450 inhibition, picric acid, hydrochloric acid, and acid or acids would all be tagged in the "substance" category.
That generalization doesn't apply when we specifically assign a subclass to a category of its own. In CYP450 inhibition annotation there is a general class of "substances" and a special class, "CYP450". Enzymes are substances, of course, and most enzymes are tagged as "substance", but the enzyme CYP450 and its variations are tagged as "CYP450", not as "substance".
Interpreting Ambiguous Text
Don't make an interpretation based on the text; tag only what's actually said. Here's a sentence from an oncology abstract:
The following question arose:
- Since the sentence says that Southern blot analysis indicates the presence of the variation, should Southern blot analysis be included in the string tagged for Variation? [That category was in information-gathering mode at the time.]
The biomedical specialists decided: No; "wait for the diagnosis". Southern blot analysis is an analytic technique, not a variation. The sentence is reporting the results of the analysis. When the text reports a variation, tag that as such, but not the steps the authors took to produce their finding. The same would apply to all other tests and methods.
Split Coordinations/References and Chaining
Often, a number of similar entities or events will be described in a single collapsed manner, such as
Researchers study events and entities individually even though they may describe them collectively, and a description like this refers to seven different entities (loss of heterozygosity at chromosomal locus 3p14, loss of heterozygosity at chromosomal locus 7q31-32, ...), not to a single event involving seven changes. This is no different in principle from such everyday usage as "Bill, Kate, and the Smiths all brought their children to the party" (three families), but it does cause problems for annotation.
What do we do with a phrase like "organic and inorganic acids"? This combined description plainly means "organic acids and inorganic acids", but it doesn't include a string "organic acids" for us to tag.* For a long time we used to tag each component of such a discontinuous reference with the same label that we would want to apply to the whole string, and use the comment field of the annotations to record the connection, like this:
| text | label | comment |
|---|---|---|
| organic | Substance | ... acids |
| inorganic acids | Substance | (none) |
| acids | Substance | organic... |
* NOTE: Don't even think of using "inorganic acids" without the first two letters. (1) It's a cheat. (2) It's not the actual string. (3) You can't generalize such a technique.
But the comment fields were not accessible to the learning algorithm machine or the data retrieval software and only served as a guide to other annotators in subsequent stages.
Around December 2003, we developed and introduced a chaining tool that lets us explicitly annotate discontinuous entities by building a chain of strings and applying a single tag to them. It is used only in entity annotation (except in Office Letter Annotation) and only for split references, except in "information gathering mode".
The following examples show
- the actual text, with italics for the part shared by the coordinates
- what the wording would be if fully expanded, with boldface for the part that has to be tagged as a chain
- the links of the chain, separated by "(+)"
K- and N-ras mutations
K-ras mutations and N-ras mutations
K- (+) ras
CYP1A1/2 (The slash presumably means "and", "or", or "and/or".)
CYP1A1 / CYP1A2
CYP1A (+) 2
androgen and estrogen receptors
androgen receptors and estrogen receptors
androgen (+) receptors
Stage I or II
Stage I or Stage II
Stage (+) II
male or female descendants or siblings
male descendants or female descendants or male siblings or female siblings
male (+) descendants
male (+) siblings
female (+) siblings
Note that the shared part can be to either the right or the left of the unshared parts. The last example shows how all the content of the string may need to be included in chains: only "female descendants" does not need chaining, since it is a contiguous string in the text.
For more information, see Chaining with WordFreak .
Chaining Syntax and Usage
In most chaining situations, one of the references in need of tagging will be an unbroken string such as "N-ras" and "CYP1A1". Chains are not necessary in these instances and may be tagged like this:
NOTE: In these documents we use "(+)" to separate links of a chain.
string: N-ras chain: N- (+) ras string: CYP1A1 chain: CYP1A (+) 2
It is also correct to tag "N-ras" or "CYP1A1" as a chain, but extra steps are involved.
Unrestricted chaining can create insuperable problems for treebanking. Chaining for split coordinations works well, but should not be used unless there is an explicit exception in the guidelines. Currently, we have exactly one such exception. Other exceptions will be reviewed on a case-by-case basis and annotators should not proceed with an additional exceptions until it has been approved. In short, chaining is reserved for entity annotation only, not for POS annotation or pretagging.
In "information gathering mode", we do not chain at all, but select and tag the whole text of the entity reference.
Embedded Abbreviations
Annotators may be tempted to use chaining to skip an embedded abbreviation. For example,
This expression should not be chained, given the existing practice and our desire to avoid creating tags that will cause problems for treebanking and propbanking. There is no chaining needed in the above example as there is no coordination. It should be treated as a single reference.
This applies even to an abbreviation embedded in a split coordination:
We do want to chain "1A1", "1A2", and "1B1" with the left hand part of the name, which is the correct way to chain this example. The chaining tool would also let annotators skip over "(CYP)". However, in order to stay consistent with existing practice, we treat the abbreviation as embedded in each case, like this:
chain: cytochromes P450 (CYP) (+) 1A2
Another complex example
Breaking the list up for analysis shows that we have six references:
3-acetate (III),
3-bromoacetate (IV),
3-propionate (V),
3-methyl ether (VI), and
3-deoxy-derivative (VII)
of 3beta-hydroxyandrost-4-ene-6,17-dione (I) were synthesized...This involves 5 separate chains and one contiguous string:
3-acetate (III) (+) of 3 beta-hydroxyandrost-4-ene-6,17-dione (I)
3-bromoacetate (IV) (+) of 3 beta-hydroxyandrost-4-ene-6,17-dione(I)
3-propionate (V) (+) of 3 beta-hydroxyandrost-4-ene-6,17-dione (I)
3-methyl ether (VI) (+) of 3 beta-hydroxyandrost-4-ene-6,17-dione(I)
3-deoxy-derivative (VII) of 3 beta-hydroxyandrost-4-ene-6,17-dione (I)Further examples of chaining with embedded abbreviations
"CYP 1A1 messenger RNA" is tagged as substance. There is also a chain "CYP1A1 (+) protein" which will be tagged as CYP.
(2) the 432L, 453S form of human CYP1B1 ...
(3) the expressed 432L, 453S form ...
Q: Should '432L' and '453s' be tagged as substances or CYP? These are forms of the enzyme in which a particular amino acid substitution has occurred at a particular location.
A: In #2, tag "432L (+) form of human CYP1B1" as a chained CYP reference and "453S form of human CYP1B1" as a solid CYP reference; in #3, "432L (+) form"; and so on.
Adjectival Forms
There are times when specific entity information appears as an adjective rather than its usual noun form, such as "point mutational activities" for the entity "point mutation" (a Variation-type in oncology tagging). In this case, "point mutational" is tagged as a Variation-type in order to capture specific descriptive information. Other cases are brought up for group discussion on the appropriate mailing list.
Singular and Plural Constructions
Generally, annotators do not distinguish between singular and plural constructions in marking entities. There are instances when plurals are incorrectly formed with an apostrophe, such as "Ki's" for the plural of "Ki". Tag the whole string (including the "s") as a singular would be tagged.
Main Page : Entity
