ONCOLOGY GUIDELINES
From BioIE Wiki
Main Page : Entity : Oncology : Oncology Guidelines
A note on notation
We will often use
- underlining for text that should be tagged as the entity type under discussion
- italics for text that should not be so tagged, or for comments within an example box
- boldface for other highlighting
- the symbol (+) to indicate chaining of non-adjacent pieces of text into a single tagged reference
Contents
|
Gene
We have three subtypes of Gene entity:
- Gene/RNA: for genes and RNA elements
- Gene-Protein: for the non-genomic downstream products of genes and RNA elements
- Gene-Generic: for cases that are ambiguous between the first two, or that refer to both the gene and its downstream product(s)
Gene/RNA
"Gene-Gene/RNA" is for genes and RNA elements (see the Definition).
The term "parvalbumin" would be tagged as Gene/RNA.
Glc-6-Pase mRNA
"XXX mRNA" and "XXX transcript" are both Gene/RNA; tag the entire phrase.
Gene/RNA Chaining
This is 3 genes, each with two aliases:
- ret-1/PTC-1
- ret-2/PTC-2
- ret-3/PTC-3
We can tag this with the chaining tool (the plus sign in parentheses indicates chaining):
- ret (+) -1
- ret (+) -2
- ret (+) -3
- PTC (+) -1
- PTC (+) -2
- PTC (+) -3
Leave "kinase" out of the tag and just tag "STK15" and "BTAK", separately.
Tag the whole phrase.
Gene Names that Include Other Gene Names
Q: In "PRDC (protein related to DAN and cerberus)" [PMID 09639362] I tagged "PRDC" as a gene and "protein related to DAN and cerberus" as a gene. Do I tag DAN as a gene? Is cerberus a gene?
A: No; no tag within tag. Your tags are correct, and "protein related to DAN and cerberus" is the name of a gene. Tagging "DAN" as well would create tag within tag. "Cerberus" is also a gene, but we won't tag it, same reason.
Q: Several lines down they refer to "the products of DAN (differential screening-selected gene aberrative in neuroblastoma)". I tagged "DAN" as a gene. I tagged "differential screening-selected gene aberrative in neuroblastoma" as a gene. Do I tag "neuroblastoma"?
A: No: no tag within tag, even across categories (i.e., not even a Malig-type within a gene/RNA).
Q: Do we still tag "tumor suppressor genes" as a whole, or only "tumor suppressor"?
A: Tag just "tumor suppressor". The plural "genes" is significant, but the POS tags and Treebank will allow us to recover that.
See the general rule on tags within tags.
Gene/RNA Exceptions
Q: I've seen "enhancer" and "promoter" tagged as gene/rna but don't think they should be?
A: "Enhancer" and "promoter" should not be tagged as gene/rna normally, as they usually are referring to components of genes that are not usually even transcribed. However, it may occasionally be true that they are talking about a property of a gene/protein ("We found this protein to stimulate reaction X. This reaction enhancer/promoter..."), in which case they would be tagged.
Gene-Protein
"Gene-Protein" is for the non-genomic downstream products of genes and RNA elements.
Labeling Proteins
heparin sulfate proteoglycan (HS-PG)
Tag anything containing a protein as a protein. (But see Gene-Entity Boundary Issues.)
Protein Complex Labeling
Q: Is (cadherin-catenin) tagged as gene-protein in one tag, or separated without the hyphen (cadherin)-(catenin)?
A: Tag them separately: "cadherin-catenin".
Tag each as Gene-protein, chaining for B and C:
- HLA-A
- HLA- (+) B
- HLA- (+) C
- HSAN 1.2
Q: Do I tag "135 kDa glycosylphosphatidylinositol-linked glycoprotein" in the first sentence? And besides "contactin/F11/F3" do we tag anything else in the last sentence?
{{box|The Neuro-1 antigen, structurally characterized as a 135 kDa glycosylphosphatidylinositol-linked glycoprotein...
A: Yes.
Some General Examples of Gene-Proteins to Tag
- antigens
- families of proteins, such as G-proteins
Not Tagged as Gene/Protein
- nucleic acids
- purines
- pyrimidines
These should be tagged as variation-state-original, -altered, or -generic.
"X Gene", "X Protein", "Kinase X", etc., vs. "Pseudogene"
When you see a phrase like "the p53 gene" or "an N-ras protein", don't include the word "gene" or "protein" in the tag. It tells you whether to use the Gene/RNA button or the Protein button, and once you've done that the word itself is superfluous. The same goes for "oncogene".
But a pseudogene is not a gene, so if you're tagging "XXX pseudogene" include "pseudogene" as well:
an N-ras protein (Gene/protein)
the N-myc oncogene (Gene/RNA)
the c-fos protooncogene (Gene/RNA)
"X Y-ase" is often an enzyme (a Y-ase) that acts on X, so it is safest to include the last word in the tag if it is an enzyme name. But if the "Y-ase" precedes the "X", as in "kinase STK15", you can pretty well tell that it is redundant and explanatory-- "STK15 (which, by the way, is a kinase)"-- so do not include it in the tagged string:
kinase STK15 (Gene/protein)
Gene-Generic
Usually there is no problem deciding which tag to use. But the same name or symbol can be used for a gene and for a protein that expresses it. "Generic" is just for those times when either it's not clear whether the reference is genomic (Gene/RNA) or proteomic (Protein), or the author was evidently referring to both types together. Most of the time you will use either Gene/RNA or Protein. The need for Generic is fairly uncommon. Some examples may include words like "mRNA" or "protein" that we do not include in the tag:
Q: Since they mean "MK mRNA or MK protein", and we do not tag "mRNA" or "protein", how do I tag "MK" as an mRNA and as a protein without tagging it twice?
A: Tag 'MK' as Gene-generic.
Q: Here would 'ras' be tagged as gene or are they referring to the protein product, or maybe referring to both the gene and protein and thus tag it as gene-generic?
A: Since you can't tell whether they mean "immunostaining with the ras oncogene protein" -- in which case you would tag it as Gene/protein -- or "immunostaining with protein, to test for the ras oncogene" (which would be Gene/RNA), tag 'ras' as Gene/generic.
Gene-Entity Boundary Issues
1. Generally, we tag a protein (or RNA) as a Gene-entity value only when it can be traced back to a single gene in the genome.
REASON: This entity is based on gene identification, and the research is not intended to annotate all the protein mentions but the downstream product of a single gene.
2. Some proteins, composed of different subunits expressed by different genomic components, are not considered to be Gene-entities. Instead, their subunits are entities.
REASON: Only each subunit is the downstream product of one gene. The complex consisting of the combination of subunits is not our target, although it might be a functional protein. For example, hemoglobin is composed of alpha and beta subunits, so the subunits rather than the whole protein are Gene-entities.
3. A protein complex or aggregate can be composed of different proteins, in which case they are definitely not included in Gene-entity. For example, a microtubule is a protein aggregate composed of 3 different types of tubulins (protein subunits).
4. A protein complex or aggregate composed of a single protein type (from a single gene) is included in Gene-entity if it is not referring to the sub-cellular structure but the protein itself. For example, "Gap junction" is a sub-cellular structure, and is composed of 6 identical subunits. It should be included in the Gene-entity only when referring to the protein component of the structure in the texts.
Structure reference example (do not tag):
Protein reference example (tag as Gene/protein):
5. Antibodies are composed of different genomic fragments, so they are not Gene-entities at all. All antibody mentions are excluded, even very specific ones like "anti-x antibody".
6. Gene or protein family mentions are included in this entity unless they are too general, such as "enzyme", "gene", or "protein". But "kinase", "tyrosine kinase", etc., are considered as different levels of gene/protein families and are tagged.
7. All decisions on gene or protein names are based on the textual context. Gene/protein descriptions can be considered as entity values when the authors intend to.
Three ways to determine whether a string is a pure description or a value of this entity:
- See if it is on the gene list
- See if it is used in the text as a gene name
- Send it to the email list
Variation
This category includes six tags:
- 1. Variation Type
- 2.Variation Location
- Variation States:
- 3. Original State (wildtype) of the amino acid(s) and/or nucleic acid(s)
- 4. Altered State of the amino acid(s) and/or nucleic acid(s)
- 5. Generic State tag, like the generic gene-entity tag, for when you can't tell whether a state is being described as initial or altered.
- 6. Variation Event; the variation as a whole.
Variations are extremely complex entities, actually involving a relationship between these components. Although there is a proposed standard notation to describe them, it is hardly ever used, and the literature contains a great many different ways of describing them.
Here some examples of the categories that we are now using to describe Variation. These lists are not exhaustive; they keep growing as we look at files and you ask questions.
Variation Type
Specifies the kind of change in the genomic material in a particular instance of variation, or a particular group of instances. The following list is not exhaustive.
- n-... [terms beginning with a numerical prefix: mono-, di-, tri-, tetra-,...]
- n-ploidy, -ploid; also
- hyper-n-ploid(y)
- hypo-n-ploid(y)
- n-somy
- Abnormal expansion
- Allelic duplications
- Allelic imbalance (synonym for "homozygous deletion")
- Allelic loss (synonym for "homozygous deletion")
- Amplification (when it is describing a cellular process of creating an abnormal number of copies of a particular genomic sequence, as in "amplification of the MYCN gene")
Caution: "amplification" is also used as the name of an analytic method: "Peptide nucleic acid (PNA)-mediated PCR clamping was used for mutant-specific amplification." In such uses it is not a variation type. - Amplified
- Aneuploidy
- Basepair change (synonym for "point mutation")
- Copy number increases
- Deletion: This can include the deletion or partial deletion of a chromosome or arm of a chromosome, which may be notated as, for example, "del(20q)". Here "del" is the variation type and "20q" is the location. [Don't tag the parenthesis characters, "(" and ")", at all.]
- Double minute chromosome
- DNA ploidy
- Duplication
- Expanded intrachromosomal region
- Fission
- Frame shift mutations
- Hemizygous deletion
- Heterozygote mis-sense mutation
- Homozygous deletion
- Hyperdiploidy (and the adjective form hyperdiploid; also see n-...)
- Hypotetraploidy (and the adjective form hypotetraploid; also see n-...)
- In-frame deletions without frameshift
- Insertion
- Interstitial deletions
- Inversion
- Large fragment deletion
- Loss (as a synonym for "homozygous deletion")
- Loss of heterozygosity (or "LOH")
- Microdeletion
- Missense mutation
- Monosomy
- Nonsense mutation
- Point mutation
- Polymorphism
- Rearrangement
- Restriction fragment length polymorphism (RFLP)
- Silent mutation
- Single base mutations
- Single-base missense mutation
- Single nucleotide substitution
- Single strand conformation polymorphism (SSCP)
- Substitution
- Tetraploidy (see n-...)
- Transition
- Translocation
- Transposition
- Transversion
- Trisomy (see n-...)
- Truncated mutation
- Truncation
- Wild type, wild-type
While this actually denotes the absence of a variation, it is often used in contrast to variations, and our researchers have decided that it is best treated as a Variation-type.)
Alternate Names
Besides synonyms, there are many ways of referring to mutations. People may refer to any of these with or without the word "mutation". Someone may say something like "the transition". And so on.
Sometimes the name is used in an adjectival form, as in "point mutational activities". In this case we would tag "point mutational" even though "mutational" is grammatically an adjective.
Not Variation-Type
- alterations, genetic alterations
- These expressions are entirely too general to tag.
- methylation
- something that happens to the gene, not a change in the gene itself
- microsatellite instability
- not a type of variation, but a characteristic of the DNA that makes it prone to variation
- hypermutability, tendency for mutation
- descriptions of DNA that is prone to variation
- overexpression
- Defined as "excessive expression of a gene by producing too much of its effect or product"
Variation-Type Chaining to Avoid Tag-Within-Tag
Q: How should I tag "non-myc-amplified"? Here the gene and the variation type seem to be hyphenated together; and, even worse, the combination is then negated with "non-"!
A: If it were just "myc-amplified", it would go like this:
- myc ==> G/RNA
- amplified ==> Var-type [don't tag hyphen between "myc" and "amplified"]
But for "non-myc-amplified", tag the whole string "non-myc-amplified" --
-- as Variation-type, with a Comment that you really mean "non-amplified". And that means not tagging "myc", to avoid tag-within-tag.
Variation Location
The place within the genomic material where the change occurs. We tag all genomic locations, and only genomic locations. Protein locations are excluded, except for amino acid locations, which are a conventional way to specify variation location and can be easily traced back to genomic locations.
- Examples of Protein locations (do not tag as Variation Location):N-terminus
phosphorylation site
zinc finger
loop region
- Example of amino acid location (tag as Variation Location): There is a point mutation converting Ala to Asn at position 159 of the amino acid sequence.
Question: Should "5' flanking region" be tagged as var-location?A plasmid carrying the 5' flanking region of the mouse proliferating-cell-nuclear-antigen (PCNA) gene or DNA polymerase beta gene [PMID 7909518]Answer: Yes; it is a region of the genome. Amino acid positions in the protein sequence, such as the number in "Gly64Val", are not strictly locations in the genome, but they map directly to positions in the genome and count for us as Var-loc.
Most often the location is within a gene, but not always:
codon position
codon 61.2
ser12
"ser" is the symbol for serine, the amino acid coded by this codon, tagged as Variation-state (either -original or -altered, or possibly -generic).
S12P
These notations describe a change of the serine (Variation-state-original) at codon 12 to proline (Variation-state-altered).
nucleotide or protein sequence position
G48
"G" stands for glycine (the nucleotide at this position), which is tagged as Variation-state like "serine" above.
The guanine (Variation-state-original) at base pair 48 changes to adenine (Variation-state-altered).
cytogenetic band
t(11; 14)(q13; 32) (tagged as 11 (+) q13 and 14 (+) 32; see Translocations)
gene
Sometimes the variation location can be a gene, when the entire gene rather than a part of it is the object of the variation. In such cases we double-tag the gene as gene/RNA and as location:
deletion of the K-ras gene
type---- G/RNA
loc--
translocation of the H-ras gene to location such-and-such
type--------- G/RNA loc-------------------
loc--
This can happen with at least the following variation types:
- amplification
- deletion
- duplication
- translocation
NOTE: Double-tag genes as both location and Gene/RNA only where it is clear that the variation affects the entire gene. Do not double-tag in expressions like
- "a deletion mutation in the K-ras gene at codon 5"
Here the variation is specified as affecting a specific section of the gene (codon 5), not the whole gene. - "K-ras mutations"
This means simply that the location is somewhere in the gene.
chromosome arm or chromosome
The variation location can also be a larger unit, such as a chromosome or chromosome arm:
del(13q)
Tag "del" as type and "13q" as location.
The location may also be included in a single string of notation together with the original and altered states, as in the example above or in "Gly64Val".
relative location
Relative locations by themselves don't mean much in terms of pointing out the exact positions on the genome, but they can add precision to some absolute locations.
GC box
promoter
codon 32
range as location
A location can be specified as a range:
in codons 18 through 20 of the Ki-ras gene Loc----------------- G/RNA-
And sometimes this range is stated in terms of genes: "from gene A to gene B", "between genes A and B", etc. In such cases, we will tag the range as a location AND we will tag the genes within it (as "gene/RNA"). (This forms an exception to our usual rule against tags within tags.)
from gene A to gene B
G/R G/R
Loc--------------------
between genes A and B
G/R G/R
Loc------------------
More examples
In each of the following examples, the underlined strings should be tagged as variation-location.
On chromosome analysis of a metastasis, a stemline with karyotype 47,X Y,+der1 (1 qter---1 cen::1q21---1 qter) was identified. [PMID 4044631]
Decreased luciferase activity was observed with promoter constructs that lacked one or two E-box sequences or had E-box double point mutations, while a truncated MPR1 promotor lacking all three E-boxes exhibited only basal levels of activity. [PMID 14737110]
NACP-Rep1, a polymorphic microsatellite upstream of the alpha-synuclein gene
and in the title
Functional analysis of intra-allelic variation at NACP-Rep1 in ....gene. [PMID 12923682]
Since the mutations in the Ha-ras and Ki-ras oncogenes were located opposite potential pyrimidine dimer sites... [PMID 2064725]
Variation States
Variation-State-Original, Variation-State-Altered, Variation-State-Generic
A variation is a change from one state of the genome to another. We have separate tags for the original and the altered states, as well as a "state-generic" tag for use when it isn't clear from the notation and the immediate text whether a state is original or altered.
The states may be expressed in prose, as in "change of glycine to alanine", or as a formula that shows the two states linked by an arrow or similar marker. Such a formula may also include the location, as several of these examples do. The original state is shown here in red italics, the altered state in green italics, and the location in blue (not italic).
Described as amino acid change:
Gly->Ala Gly48->Ala G->A Glycine to Alanine Try-557 --> stop
Described as nucleic acid change:
A->T CAG->CTG A48->G c.48A>G
Q: In this sentence would the second C be variation-state-generic?
A: No, it would be variation-state-original. The authors are saying that the C that changes to a T in "two of the C to T transitions found in nonsmokers" is the C of a CpG site. That's variation-state-original.
Variation Event
This category refers to the variation as a whole. It is similar in concept to the un-subdivided "Variation" category we began with, but its scope is limited to names or terms that refer to a whole variation, not long strings of text that describe it. (See Variation-Event Introduction for a fuller explanation.)
We use the Variation-event tag in two circumstances.
Frequently a variation or group of variations is described in specific detail in one or two sentences and is subsequently referred to with a phrase like "the mutation" or "this deletion" or "these point mutations"; or the reference may precede the description, either in the title or in the text. As long as the reference is to a variation event that is specified in terms of location, type, and/or state, tag it as a Variation-event, excluding determiners (the, this, these, that, those, ...). If it refers to a group of variation events tag it only if they are described as a group sharing at least one kind of specification.
Some genomic variations are common enough or important enough in research to have names of their own. Down's syndrome (trisomy 21) is so widespread that the name is familiar to many laypeople. Others that we have encountered in this project are:
bcr/abl Philadelphia chromosome
Important note: Use this tag only when there is at least some specific information about the variation: at least a location, type, or (any kind of) state. Do not include the specific information in the tagged text. It doesn't even have to be in the immediate vicinity, as long as it clearly applies to the text you're tagging as a variation event. Some examples:
"Philadelphia-chromosome" and "bcr/abl" are names for specific variations. Their definitions include the type, location, and original and altered states. (You would have to know this, whether from a question or a reference tool; it's not evident in the context.) Tag the underlined expressions.
"Genetic anomalies" is highly restricted by the context, which includes mentions of specific locations and types. We use the context to evaluate the reference, but we don't include it in the tag.
Variation-Event Example
"mutations" can be treated as var-event, but not "activating mutations".
Variation-Event Exception
If (and only if) "chromosomal aberration" refers to a specific variation, it will be tagged as var-event. In this text it does not, so it should not be tagged.
Variation Examples
Acronyms
An acronym (or initialism, in these cases) is a single token and has to be tagged as a unit, even if its letters can be associated with words that would be tagged separately.
BCM should be tagged as variation-event:
- beta-catenin -->Gene.gene
- mutations --> Variation.event
- BCM -->Variation.event
They are using "KRM" to refer to a type of Variation.event, so it will be tagged as such.
- K-ras (Gene.gene)
- mutations -->Variation.event
- at codon 12 -->Variation.location
- KRM -->Variation.event
- BCR -->Variation.location
- rearrangement -->Variation.type
- BCR rearrangement -->Variation.event
Variation-Type & Variation-Location
- deletion -> variation.type
- fifth (+) chromosomes -> variation.location
- seventh chromosomes -> variation.location
Variation-State-Original & Variation-State-Altered
Amino acids can be states. Here "glycine" is the altered state, and "aspartate" and "GAT" are both original state.
In the formula
G:C to T:A
(or "T:A", "G:C->T:A", and other forms) we tag as follows:
T:A - var-state-altered
Variation-Type, Variation-Location, Variation-States, & Variation Event
As above, we use red italics for variation-state-original, green italics for variation-state-altered, and blue straight type for location. Other variation entity types -- type, event, and state-generic -- are underlined.
Nucleotide sequence analysis of one Hep G2 N-ras allele demonstrated that codons 12, 13, and 59 were normal and that codon 61 had a missense mutation (CAA to CTA). This mutation results in the incorporation of leucine instead of glutamine at residue 61 of the N-ras gene product, p21. [PMID 02154325]
- 'codons 12' -> var-location
- 'codons (+) 13' -> var-location [chained]
- 'codons (+) 59' -> var-location [chained]
- 'codon 12' -> var-location
- 'missense mutation' -> var-type
- 'CAA' -> var-state-initial
- 'CTA' -> var-state-altered
- 'glutamine' -> var-state-initial
- 'leucine' -> var-state-altered
- 'This mutation' -> var-event
- 'leucine' -> var-state-altered
The results of dot blot hybridization assays and DNA sequencing showed a G-to-C transition of the first nucleotide at codon 13 c-Ha-ras. This is the first time that such a mutation has been detected in human cancer tissues. [PMID 2108944]
- 'G' -> var-state-orig
- 'C' -> var-state-alt
- 'transition' -> var-type
- 'first nucleotide' -> var-loc
- 'codon 13' -> var-loc
- 'such a mutation' -> var-event
IRP binding is abrogated when APP cRNA probe is mutated in the core IRE domain (Delta4 bases:Delta83AGAG86.) [PMID 12198135]
Our domain expert commented: This is absurd phrasing, but it still requires a solution. I would suggest tagging "core IRE domain" as var-loc, "delta4 bases" as var-type, the second "delta" as var-type, "83AGAG86" as var-loc, and "AGAG" as var-state.
- 'core IRE domain' -> var-loc
- 'Delta4 bases' -> var-type
- the second 'Delta' -> var-type
- '83AGAG86' -> var-loc
- 'AGAG' -> var-state-generic [tag within tag]
Variation-State-Generic
a point mutation of N-ras at codon 12 (N12-cys) and codon 61 (N61-his)
...
19/26 colonies contained the N12-cys mutation. The N61-his mutation
was not detected in any of the colonies obtained. [PMID 7803279]
'cys' and 'his' should be tagged as altered states, but since it wasn't that clear, tagging it as state-generic would be fine too.
Tagging Specific Types of Variation
Some types of variation are more complex than others, or raise questions about how to tag them. Here are some specifics.
Translocations
These are a complex type of variation, in which pieces of chromosomes get swapped around. Most of them involve a single exchange between two chromosomes:
wild type: chromosome A: aaaaaaaaaaaaaaaaaaaaaaAAAAA
chromosome B: bbbbbbbbbbbbbbbbbbBBBBBBBBB
variation: chromosome A: aaaaaaaaaaaaaaaaaaaaaaBBBBBBBBB
chromosome B: bbbbbbbbbbbbbbbbbbAAAAA
There's a fairly standard notation for these:
t(1;15)(p36.3;q24.2)
meaning:
translocation with chromosome 1 being split at arm p, band 36, subband 3 and chromosome 15 being split at arm q, band 24, subband 2
and then the halves swap places.
Now, the original and altered state are implicit in this information, but they are not explicit there. There are two locations (here, 1p36.3 and 15q24.2), but they're not "before" and "after". But in annotating translocations we will tag one of the locations as "state-original" and the other as "state-altered". It doesn't theoretically matter which is tagged as which, but for consistency's sake let's tag the one mentioned first as original.
So we would tag this piece of notation
t(1;15)(p36.3;q24.2) t -- var-type 1 (+) p36.3 -- var-state-orig 15 (+) q24.2 -- var-state-alt
with each of the two states being a two-part chain.
Deletions
Deletions can be described with more or less detail and in different ways. Sometimes the states will be specified, other times they will not be. Here are some examples:
deletion of bp 23-25'
We will tag the base pair range as location, not state. (Assume that most of these examples begin with "deletion of", and tag "deletion" as type.)
... exon 6
Similarly, an exon, or an intron, or a codon, or a range of them, will be a location.
deletion resulting in GGCTT -> GT
Here we have explicit original and altered states, but no location.
... 3 base pairs in exon 6
We have a location, but the text doesn't say which base pairs are deleted, so we don't have any states or more precise location.
... D1S434-D1S228
This range specifies a location, in terms of markers that are used to identify specific regions. This is similar to saying "between genes X and Y" where the range between X and Y is the location.
... GCT at bp23-25
The nucleotide sequence GCT is the original state, located at base pairs 23-25.
The key decision here is to distinguish whether a specification of nucleotides, base pairs, or amino acids constitutes a state or a location. If the text identifies them by symbol or name, they're a state; if only by address, so to speak, they're a location.
Malignancy
Malignancy Type
This refers to the names clinicians give the different types of cancer. As you can imagine, just like genes, there are different ways to name a single malignancy type: morphologic features, histological observation, anatomical location, the name(s) of the discoverer or patients, and many more. These criteria are not mutually exclusive. "Leukemia" could be considered as either an anatomical or a histological type-name, but either way it's a Malignancy-type. "Squamous cell carcinoma" and "Ewing's sarcoma" are made up of a cancer name and a modifier; the unmodified name by itself ("carcinoma", "sarcoma") would be a Malignancy-type, but we don't tag it within the more detailed name; we tag the full phrase.
We don't have a list of names of Malignancy-type, and we probably never will have a complete one. This is something like information-gathering, but with some restrictions. With your bio or medical backgrounds, you will probably be able to recognize what is meant as the name of the cancer -- the Malignancy-type -- most of the time. Tag it. But we are restricting it: no prepositions. If you see "cancer of the lung", the Malignancy-type ends at "of": it's just "cancer". Tag "lung" separately as Malignancy-site.
cancer of the lung
------ Malignancy-type
---- Malignancy-site
In the list below, the text you would tag as Malignancy-type is italicized. When you see a name that you think should qualify as a type but doesn't fit any of the criteria in this list -- morphology, histology, anatomy, or eponymy -- tag it and mention it on the onco-list mailing list.
Morphology: the cell types affected by the cancer. Some examples:
squamous cell carcinoma -- (squamous cell) neuroblastoma -- (neuroblast) glioma --(glia)
Histology: the type of tissue affected by the cancer:
carcinoma -- (epithelial tissue) leukemia -- (the blood-forming organs)
Anatomy: at which body parts the cancer is active:
lung cancer -- (lung) brain tumor -- (brain) retinoblastoma -- (retina)
Eponymy: the name of the person who first described the cancer, or in whom it was first described:
Brenner tumor Ewing's sarcoma Hodgkin's lymphoma non-Hodgkin's lymphoma
Malignancy Adjectives
Tag adjectival forms of Malignancy-type as well, such as
adenomatous carcinomatous cervical cancerous
Neoplasms
An annotator asked about the treatment of the word "neoplasms" in the following context:
Although we do not tag "neoplasm" by itself, it is tagged as malignancy-type when modified, just like "tumor":
- hematolymphoid (+) neoplasms
- adrenal (+) neoplasms
- ...
- Small round cell neoplasms
Not Malignancy-type
- "metastasis", "XYZ metastasis"
Not a Malignancy-type. But metastasis can be a Malignancy-clinical-stage. - Types of normal tissue
E.g., "fibroblast" is not a Malignancy-type; tag it as Malignancy-histology. - Premalignant Conditions
Tag these as Clinical-stage:- MEN type 2B syndrome
- dysplasia
- myeloproliferative disorders (syndrome)
"Type 1 and Type 2"
An annotator asked about tagging "type 1" and "type 2" in the following phrase:
These references should not be tagged. They are a distinction made by the authors describing the affected site, and are not clinical names of malignancies.
Genes Named After Malignancies
An annotator asked about the treatment of a gene that is named after a malignancy:
"Ewing's sarcoma gene" should be tagged only as a gene. Although "Ewing's sarcoma" by itself is certainly a Malignancy-type, the longest or most encompassing tag always takes precedence. The exceptions to this rule are the malignancy-attribute entities (clinical state, histology, site, hereditary status, and differentiation) that should be tagged when found within a phrase tagged as malignancy-type. See below for more details on how to treat these malignancy attributes.
Specificity
In general, do not apply the Malignancy-type tag to mentions of tumor masses that do not actually specify the type of tumor. "Metastasis" is a particular case of this general rule.
Malignancy Clinical Stage
We use this attribute for two distinct types of mention:
- Description of the tumor on a scale, whether formally structured ("Stage 1", "Stage A") or informal ("lower stage").
- Names of premalignant conditions.
Staging Systems
Tumors are usually staged clinically by researchers. This attribute is used to evaluate the extent of a cancer within the body, especially whether the disease has spread from the original site to other parts of the body. There are different staging systems for different kinds of tumors.
There are three staging systems used for neuroblastoma: the Evans System, the St. Jude System, and the International Staging System. We may see any of these; any of these would be tagged as Clinical Stage. Other systems are used for other kinds of cancer. Tag them all, not just for neuroblastoma.
The International Neuroblastoma Staging System (INSS) is now universally used to stage neuroblastoma:
- Stage 1: Localized tumor confined to the area of origin, with complete gross excision, lymph nodes microscopically negative.
- Stage 2: The tumor extends beyond the structure of origin, but does not cross the midline, with (2B) or without (2A) ipsilateral lymph node involvement.
- Stage 3: Tumor extends beyond the midline, with or without bilateral lymph node involvement.
- Stage 4/4S: Tumor disseminated to distant sites, such as bone, bone marrow, liver, skin or lymph nodes.
The older Evans system for neuroblastoma:
- A (=INSS 1)
- B (2; 2a, 2b)
- C (3)
- D (4)
- DS (4S)
Besides the specific terms used in specified staging systems, some general terms can also be used to state the clinical stage of the tumor, and so should be treated as the values of this attribute as well, such as
- advanced stage
- lower stage
- high stage
"Acute/Chronic"
These modifiers are not tagged as clinical-state or any other entity.
Pre-malignant Conditions
Q: Should "MEN type 2B syndrome" be tagged as a malignancy? [PMID 9718653] I ask because the definition for it (found in the NCI Metathesaurus) stated that it is characterized by the 100% incidence of medullary thyroid carcinoma.
A: The domain experts decided that all premalignant conditions should be tagged as Clinical-stage, restricting Malignancy-type to "established cancer names". At some point in the future we may develop a separate way of annotating references to premalignant conditions, but this will do for now.
Pre-malignant Diseases in Clinical Stage Boundary Issues
Only those diseases which can potentially develop into malignancies are treated as the values of the malignancy attribute Clinical Stage.
Note: It is not always easy to distinguish between diseases associated with malignancy and diseases which have the potential to develop into malignancy. When there is no clear indication in the texts, you should either use the NCI metathesaurus browser or send it to the email list.
Examples of pre-malignant diseases:
benign tumors neoplasia dysplasia polyp
Malignancy Developmental State
See Developmental State.
Malignancy Histology
This attribute specifies cell and/or tissue type(s) affected by benign or malignant tumors. It includes nothing below the cell level (subcellular components such as "nucleus" or "prokaryon", which we do not tag) and nothing above the tissue level (body structures such as "eye" or body regions such as "arm", both of which are Malignancy-site).
Decisions on Histology vs. Site Annotation Disambiguation
Histology vs. Site Boundary Issues
a. Histology covers groups/types of cells or tissues. It could be the combination of different types of cells or tissues.
Note: This broadens this attribute significantly from our previous usage. As long as the texts talk about the cells or tissues clustered by some standards, they are attribute values.
Examples:
cardiac cells cardiac tissues nerve tissues neuroblastoma cells
b. When Site is used for Histology (e.g., "nerve tissues") tag it only as Histology. See Histology or Site?
c. Site has different levels of specificity. As long as it specifies a body part between the tissue level and the individual level, it is the attribute value.
Examples:
dorsal nervous system liver
The terms are the same terms that are used for healthy cells. For example:
Here "glial cells" is the phrase specifying the cell type making up the tumor, and so will be tagged as Malignancy-histology. The tag should include the word "cells".
This attribute is also commonly used in naming the tumor, so Histology strings often appear as part or all of a Malignancy-type. Since Malignancy-type strings can include tagged strings of other types, such Malignancy-types will have (at least) two tags: the whole string tagged as Malignancy-type, the histological description tagged as Histology, and possibly other descriptors such as Site or Developmental-state.
In the following examples, we would tag the complete string to the left of the dash as Malignancy-type and the underlined part as Malignancy-histology, whether it is just part of the Malignancy-type or all of it. Where the histological description consists of more than one word, tag them as a single string inside the longer Malignancy-type string, not two separate strings (see last two examples).
adenoma -- epithelial cells/tissue of glandular origin (benign tumor)
carcinoma -- epithelial cells/tissue
glioma -- glial cells
leukemia -- blood-forming tissue
lymphoma -- lymphocytes
melanoma -- melanocytes
neuroblastoma -- neuroblasts
retinoblastoma -- retinoblasts
sarcoma -- connective or supportive tissue
squamous cell carcinoma -- squamous cells/epithelial tissue
chronic myelogenous leukemia -- myeloid cells/blood forming tissue
("chronic" is part of the disease name -- Malignancy-type --
but not part of the histological description)
acute lymphoblastic leukemia -- lymphoblasts/blood forming tissue
We will tag all references to cell type as Malignancy-histology, whether or not they actually are in a description of a malignancy. Even if, for example, "epithelial cells" appear in a sentence also mentioning "adenoma", both terms should be tagged as Malignancy-histology. (See discussion under Malignancy-site.)
Not Histology
The following phrases are not considered histology because the type of tumor or cell is not specified.
- tumor cells
- cancer cells
- cell lines
Tissues that are actually composed of multiple cell types should be tagged as site, not histology; e.g.,
- thyroid tissue
- lung tissue
Histology Adjectives
(A list to be added to.) Tag as Malignancy-histology:
neuronal -- refers to neurons vascular capsular
Malignancy Site
This attribute specifies the body part(s) affected by a malignancy, including organs, parts of organs, and body systems as well as terms like "leg" and "elbow" that refer to sections of the body. Terms referring to type of tissue should be tagged as Malignancy-histology.
Like Malignancy-histology, Malignancy-site is frequently used for naming Malignancy-type; in fact, all the body parts mentioned in the tumor names are the sites of the (not necessarily primary) tumors, and so are tagged with this attribute. Examples of this kind include the following (attribute values are underlined):
lung cancer cancer of head and neck colon cancer brain tumor
Tag body part names in references to metastases. Although metastasis references are not Malignancy-type, we are tagging body part names wherever they occur:
bone metastasis
Sometimes a part of the body may be referred to with the word "area" or "region". It may be redundant, or the authors may be referring to a larger region than just the body part name that modifies it. Don't try to guess or figure it out or look it up, but just include it in the tagged string. But if "area" (or similar word) is accompanied by an identifier, the phrase probably refers to a very specific section of the body part mentioned or being discussed, so include the identifier as well.
lesions in the patellar region
Conflicts in Tagging Malignancy-site
(See below for terms that refer simultaneously to a cell or tissue type and to an organ or system of the body.)
Multiple body parts may be mentioned in conjunction, possibly referring either to a single value or to different values depending upon the context. For example, one abstract may always speak of "tumors of the head and neck", while another abstract may start off discussing "tumors of the head and neck" and later go on to separate discussions of "tumors of the head" and "tumors of the neck". Rather than read the whole abstract to decide whether such a conjoined mention at the beginning should be treated as one Site or as two, you should tag "head and neck" as a single Malignancy-site. (Actually, there aren't many Sites that are conjoined in this way; maybe the only other set is "small and large intestine".)
In a coordination like "tumors of the head and of the neck", where the second conjunct has its own preposition, tag the Sites separately.
Note that "cancer of the neck" is not a Malignancy-type because of the "no prepositions" rule for that attribute. But in such expressions, do tag "cancer" by itself as Malignancy-type:
cancer of the neck
---- Malignancy-site
------ Malignancy-type
cancer of the head and neck
------------- Malignancy-site
------ Malignancy-type
head and neck cancer
------------- Malignancy-site
-------------------- Malignancy-type *
* The Site ("head and neck") precedes "cancer" and there is no preposition.
We will tag all references to location in the body as Malignancy-site, whether or not they actually are in a description of a malignancy, and even if they appear to be redundant with another mention in the sentence. For example, if "epithelial cells" appear in a sentence also mentioning "adenoma", both terms should be tagged as Malignancy-histology. In
tag as follows:
- left ankle : Malignancy-site
- osteosarcoma : Malignancy-type and Malignancy-histology
- left tibia : Malignancy-site
We are doing this for several reasons.
First, this consistency will better enable the automatic taggers that will be trained on your work to identify body parts and words referring to them.
Second, it should make it easier for you to do your work consistently with yourselves and with each other.
Third, it will do no harm to tag as Malignancy-site a body part mention which is not involved in a malignancy, just as there is no harm in tagging a mention of a gene or chromosome that is not involved in a variation, because when we mark the relationships between these tagged entities those mentions will not be included in any relation. On the contrary, they will provide useful contrastive data to the algorithms that try to learn how to mark relations: "That body part mention is involved in relationship A, but this body part mention is not involved in any relation, so the difference between their contexts should be an indicator of when a body part is or is not to be considered part of a relation."
Conflict Adjectives
Tag as Malignancy-site:
neural -- refers to the nervous system
Histology or Site?
It is not always immediately clear whether to tag a reference as Histology or as Site.
System: tag as Site. Our domain experts have decided that references to a system of tissues in the body, such as "musculature" or "autonomic [nervous system]", should be tagged as Site rather than Histology. Similarly, adjectives referring to a system (e.g., "neural") should be tagged as Site (compare "neuronal", which is Histology).
Both at once: tag as Histology. A text string may refer both to cell or tissue type and to an organ or system of the body. For example, "lymphoma" refers to both lymphocytes (cell type, so Histology) and the lymphatic system (system, therefore Site). In order to avoid double tagging, we will tag such strings only as Malignancy-histology, which carries more specific information than Malignancy-site. The histology implies the site, but not necessarily vice versa.
Malignancy Differentiation
This attribute shows the degree of tumor cell differentiation. At the early stage of normal development, cells within a particular tissue often look similar in appearance and function, a condition that is described as "undifferentiated". As development proceeds, they often change in appearance, behavior, and/or molecular characteristics, including the ability to evolve into two or more distinct cell subtypes. Many tumor cells, however, don't follow the normal developmental process, but stop differentiating at some point. This attribute indicates where that point is by specifying the degree of tumor cell differentiation.
Differentiation status of a tumor is often described roughly with phrases like
differentiated well-differentiated poorly differentiated highly differentiated
Pathologists also have a number of numerical grade systems to describe the degree of differentiation of tumor cells more precisely, with different systems used for different kinds of tumors. Higher scores usually describe well-differentiated tumors, and lower scores poorly-differentiated ones. Both the descriptive phrases and the systematic grade levels should be tagged as Malignancy-differentiation.
Malignancy Heredity Status
A malignancy can be partially or fully inheritable or can appear spontaneously without any similar family history. This attribute describes whether the malignancy in discussion has hereditary properties, that is, whether it can be transmitted from parent to child by information contained in the genes. The most common descriptions of this attribute are
familial hereditary paternal/maternal sporadic (which in this context means "not inherited")
NOTE: "congenital" does not refer to Malignancy-heredity-status. A newborn child is a nine-month-old organism, and in that period can have developed a sporadic malignancy unrelated to the parents' germ plasm. For brevity, "status" is omitted on the button in WordFreak, and probably in most of our discussions both oral and written.
Malignancy Survival
Survival-status refers to the most basic issue for all life forms, life or death. For annotation purposes we are splitting it into two components: Survival-status and Survival-status-modifier.
Survival-Status
The core concept of survival status is simply survival vs. death:
survival, survive(d), surviving,... alive dead, died,...
Often you will see it modified by some reference to the state of the disease. As long as such a modifier is contiguous to the core survival status, include it in the entity reference. Several of these combinations are so common as to have standard abbreviations:
progression free survival (PFS)
relapse free survival (RFS)
cumulative survival (CS)
overall survival (OS)
(The last two are synonymous, referring to all survivors regardless of disease condition.)Modifiers that do not refer to the state of the disease should not be included in the Survival-status or Survival-status-modifier:
died of other causes 5-year survival median survival
Many of them will be tagged in other ways; see below.
The survival-status entities present in the above sentence are the words "survival", "dead", and "alive".
Survival-Status-Modifier
Sometimes a Survival-status has a non-adjacent modifier:
9/11 children are alive, 8 without progression or relapse
"Alive without progression or relapse" would be the Survival-status if it were a continuous string, but it isn't continuous, and the permissibility of chaining here is borderline at best. So in this sentence we would tag "alive" as Survival-status, and "without progression or relapse" as Survival-status-modifier. Use Survival-status-modifier only where the modifier describing state of the disease is not continuous with the core Survival-status term.
However:
Although the word "survival" should be tagged as Survival-state, the phrase "after treatment" should not be tagged at all. Modifiers that do not refer to the state of the disease should not be included in the Survival-status or Survival-status-modifier. (See Survival-status definitions)
Developmental State
We are redefining what used to be Malignancy-developmental-state. As you annotate developmental states under these new definitions, delete the old Malignancy-developmental-state labels.
Developmental-state represents different developmental parts of the lifetime of the malignancy's host (in the sense that a parasite lives in a host):
- an individual (usually a patient or patients)
- a cell line
- a tissue
Separating quantitative expressions from the entities they refer to creates a potential problem. Under the older definitions, "young" and "old" (in "old women") and "five years old" would all be tagged the same way, as Malignancy-developmental-state. In the new definitions "five years" is Time and "old" is Quantitative-classifier. If we continue to label "young" and "old (women)" as Malignancy-developmental-state, we run the risk of confusing references tagged under the old definition with references tagged under the new one.
To avoid this problem, we'll use the label "Developmental-state", without "Malignancy", for the new definition, reserving it for words like "young", "infant/infancy", and "old" when it refers to being old rather than to the dimension of age. Here is a list of such words, borrowed from the earlier definition of Malignancy-developmental-state and expanded. We will probably find a few more such words, but we don't expect the list to expand without end.
Development of person or tissue:
embryo(s), embryonic fetus(es), fetal infant(s), infancy, infantile young, youth pediatric child, children boy(s), girl(s) juvenile(s) adult(s) old (As in "an old man", not "68 years old"; see Quantitative-classifier) cognitive functions in old men old age, old-age senile (Only when meaning strictly 'relating to old age', as in "senile dementia")
Development of cell line:
passage 23 (also see What's not a Count?) early passage
If one of these terms is accompanied by a modifier of degree, the modifier should be included in the tag as well:
very young early childhood extreme old age in very old men and women mortality in the old-old
Similarly, some comparative words can also be values of this attribute:
older, oldest younger, youngest
all were older than one year of age
In level 1 tagging, "older than one year of age" would have been tagged Malignancy-developmental-state. In level 2 tagging, "older than one year" is a Time and "age" is a Quantitative-Classifier.
Quantities
We annotate six categories of quantitative entity:
Before defining them individually, we'll discuss the kinds of parts that you may find in quantitative references.
What are Quantity Strings Made of?
No matter which category a quantity reference fits into, the tagged string will include one or more of the following:
- some kind of amount, which can be either
- a single quantity
- or a range
- a unit of measurement
- a margin of error
- a qualifier of imprecision
- a limit specifier
- an arithmetical comparison (only in Proportions)
This is not a list of quantitative entity types to be tagged separately, but of possible elements of a single quantity string. In the extended examples, the element of the type under discussion is in boldface. Comments in example boxes are in italics, as are many strings that might be mistaken for the type under discussion.
Quantity
This is usually numeric, whether written in digits or in words:
5, five 0.5, half, 50%, 50 percent, fifty percent 10(-6) (= 10-6)
Nonnumeric Quantities
In some cases the quantity can also be a nonnumeric expression, but only if the text can be interpreted mathematically in terms of one of the categories below. The following are all Proportions:
all metaphases (= "100% (of)") both groups (= "100%, n = 2") none of 16 ganglioneuromas (= "0%") most tumors in group 3 (= "> 50%")
Non-numeric and Not Quantities
In contrast, the following do not qualify as quantity strings; we can't do mathematics with them.
Here there's no way of telling how many are enough to be "many".
found in many tumors
Many Americans like sushi, but probably under 25%.
Many Penn students eat fast food at least three days a week
The last number probably exceeds 50% of Penn students, but it's certainly smaller than the number of Americans who like sushi. So how many is "many"?
in a few hours after admission [PMID 10930802]
"Few" and "a few" fail to work, the same way as "many".
We're not prepared to annotate comparisons that don't have a quantitative element. "Longer than" is like ">" for Time, but there's no quantity on the right hand side here, so we can't interpret the text mathematically. Compare "longer than five years", which we would consider a quantity. "older, younger", etc., are likely to be tagged as Developmental-state.
Range
A range rather than a single quantity. It can be expressed symbolically or verbally.
from 1 to 14 years
- (the whole phrase, referring to age, is Developmental-Stage)
Sixteen- to 21-year-old patients
- ("Sixteen- to 21-year-old" is Developmental-Stage)
patients diagnosed from 1989 to 1995 (Time)
patients with de novo AML treated between 1984 and 1986 (Time)
body temperatures from 101 to 104 C
- ("from 101 to 104 C" is Measurement)
concentrations of 400-650 ng/L
- ("400-650 ng/L" is Measurement)
Beware of expressions like the following, where "from... to..." doesn't indicate a range, but the beginning and end points of a change:
increased from 9.04 +/- 0.44 to 12.08 +/- 0.83 [PMID 9515567]
Unit of Measurement
Where applicable (Time and Measurement, not Count or Proportion):
a 5-year follow-up period
aged from 1 to 14 years
15 ml
819 muM
apparent molecular mass 31 kD
1.2 mg/L *
180 mg/m2 (i.e., "milligrams per square meter")
0.584 nmol/mg/min
* As far as we're concerned, "milligrams per liter" is as much a unit as just plain "milligrams"; compare "miles per hour", usually written "mph".
Things being counted are not units (see Count):
Two experimental groups 162 cases of neuroblastoma 15 patients 50 ways to leave your lover
Margin of Error
9.04 +/- 0.44
545 +/- 108 pmol/ml
82.6%+/-7.9%
26 +/- 19 months
Qualifier of Imprecision
a prognosis of approximately 10-20%
About one half of aggressive neuroblastomas
Limit Specifier
a limit specifier, either verbal or symbolic (shown in red because an underlined greater-than sign looks like greater-than-or-equal-to):
reduced PMA induction by >50% a median follow-up of >/=6.8 years decreased SH-SY5Y viability to <30% less than 0.01 patients alive more than 9.5 months after diagnosis survival time longer than five years above 8 years old high Ki-67 scores (> or = 50%) [PMID 11237496]
But not a bare equal sign, which doesn't modify the quantity: there is no limit specifier in
P = .001
Remember, none of the above elements except a single quantity or a range can ever be a whole quantity string by itself, and even a quantity or a range is often accompanied by one or more of the other elements. But any of them may be included in the string, subject to the limitations described below.
Impostors
Some more things that look like quantity expression elements but aren't: