Relation Annotation Pilot Study: Variation Relations

Annotators' Home
Onco Annotators' Page



What Is This About?

Tagging Relations

Up till now we have been able to tag many types of term, but not the relations between them. For a long time the senior personnel have been considering how to annotate such relationships. Now we have a relationship tool in Wordfreak. During most of October the CYP annotators were alphatesting it, and now we are ready to put it to use in a pilot project.

Example

When you annotate a text like

We report a case of colon cancer presenting point mutations at both codons 12 and 22 of the K-ras gene. PCR-SSCP and subsequent sequencing revealed that GGT (glycine, wild-type) to AGT (serine) substitution at codon 12 and CAG (glutamine, wild-type) to CGG (arginine) substitution at codon 22 occurred in the same allele. (PMID: 12110640)
you tag it something like this:

colon cancer Malignancy-type
colon Malignancy-site
point mutations Variation-type
codons 12 Variation-location
codons (+) 22 Variation-location
K-ras Gene-gene/RNA
GGT Variation-state-original
glycine Variation-state-original
wild-type Variation-type
AGT Variation-state-altered
serine Variation-state-altered
substitution Variation-type
codon 12 Variation-location
CAG Variation-state-original
glutamine Variation-state-original
wild-type Variation-type
CGG Variation-state-altered
arginine Variation-state-altered
substitution Variation-type
codon 22 Variation-location

But we have had no way to annotate the fact that "GGT", "AGT", "substitution", and "codon 12" are part of the same variation event; or to say that "glycine" and "GGT" * describe the same event at a different level of specificity-- as does "wild-type" in yet another way. Nor do we have any way to associate them with the "K-ras" gene or the malignancy "colon cancer". And all these associations are essential to a proper organization of these entities in a database in a way that will make them useful to researchers. The relation tool is intended to fill this gap.

We can think of a variation event as comprising, at its fullest, a type, a location, an original state, and an altered state. For instance:

This is the conceptual structure that Wordfreak's new relation tool is designed to implement.
[In this pilot project we are not using Variation-state-generic or Variation-event, although the tool has room for them. Variation-state-generic by definition describes a reference which is not clearly to either an original or an altered state, so you won't be putting any of these into a relation. And Variation-event is something that we (senior personnel) still don't have a clear conceptual grip on, so don't put that in either.

I can imagine a case in which a Variation-state-generic might be significant, something like
"We found that two alterations had occurred in codon 47: C->G->A, resulting in Arg->Gly->Ser"
where the middle term of each pair of transformations is the altered state of the first and the original state of the second. I don't know if the relation tool as we currently have it can even handle such cases, and in any case they are very infrequent.]

"The Big Grant"

Penn has been invited to reapply for a big grant for creating a bioinformatics center, and the people concerned here are quite eager to succeed. One of the four or five divisions of the Penn submission is to be a proof of concept of automated relationship tagging in oncology. Scott Winters is coordinating this part of the submission. There is plenty of money and resources available for the submission, which is just as well, because it is due January 1.

The overall plan for the submission is:

  1. Get the relation tool working for oncology, with just the minimal set of features required for this purpose (Eric)
    [DONE]

  2. Add relationship annotations [in progress]:

    1. Take oncology files that we have already annotated (I and III in the diagram, but all treated as a single set)
      (Originally we were going to apply this pilot to all the neuroblastoma files that had been checked in to Onco Entity C pass 1 at the time we set up the workflow. We knew that the Sanger files were richer in variations, but the neuroblastoma files had been annotated, in general, more thoroughly and consistently. But the annotators working on the files told us that there were very few variation relations to be found in them, and on November 4 we decided to modify the mix by

      1. adding files from the Sanger corpus, concentrating on files from that corpus that have at least two different types of variation entities
      2. taking out the neuroblastoma files (that have not already been checked out) that don't have at least two different types of variation entities)

    2. Annotate relations between variation subentities in those files (transform I and III into II and IV)
      (All oncology annotators, plus senior annotation personnel Ramez Zakhary, Yang Jin, and Mark Mandel. [2004-11-08])

  3. Build and train an automatic Variation relation tagger (Ryan and Scott)

    1. Divide the files into a training set and a test set (draw the horizontal dashed line separating I from III, and II from IV)
    2. Use one group of files, with their manually added relations (II), to train the automatic tagger
    3. Take the other group of files, without relations (III), and run it through the tagger to add relations (output = V)
    4. Score the tagger's relations against the relations manually added (V vs. IV)

  4. Write it up and submit it by New Year's Day (senior personnel)

We're going to add relations to the neuroblastoma files, just for Variation -- never mind the many attributes of Malignancy. We're going to need to do it fast, by about November 15, so Ryan can have the data to work with in time. Fortunately, we don't have to do it perfectly, just well enough to show that it can be done. And until the submission is handed in, or at least past the parts that require annotator involvement, this will be the top priority in oncology of everyone working on it: you, me, Yang, and Scott, at least.

The Relation Tool

The Development Version Of Wordfreak

The tool is in a special development version of WordFreak, which you can launch from a separate web site:
    http://www.cis.upenn.edu/~lacivita/projects/biologydev/launch.php
You can safely install shortcuts for it without compromising the production version of WordFreak that you already have; this version is called "Wordfreak_v2_DEV".

Accessing the Relation Schema*

When you click ADD to add a file, this version of Wordfreak will start by listing in the "Open" window
    Files of Type: Text Columns (*.tc, *.td, *.csv)
which are meaningless for us. Click on it to open the list of file types, and in that list click on Text Files or WordFreak Files, and then select and LOAD your file as usual.

Besides the familiar schemas* for Paragraph, Sentence, Oncology, CYP450, and so on, this version contains Annotations called Onco Relation and CYP450 Relation. In the Relation Annotation schemas, the Chooser window looks almost exactly like the familiar entity Chooser window, but it has an extra pane at the bottom labeled Relations. You can detach this pane by dragging its title bar. You can resize and reshape the relation window and the chooser window, and you may want to make the chooser window as narrow as you can without making the buttons unreadable, in order to free up screen real estate for the relation window.

*"Schema" is Javaspeak for the choices you make from the Annotation menu.

Each relation tree has a tiny icon at the left like a stylized key, which functions like the "+" and "-" icons in a Windows Explorer directory view: click on it to collapse the relation tree to just its title, and again to expand it.

Creating a Relation Structure

When you click on a tagged entity reference in the main window, such as "point mutations" at span 606..621, the button for its label (in this case, Type) turns purple just as in entity annotation. But if you then click on the Type button in the Chooser window, a tree with buttons of its own appears in the relation pane:

Now if you click on "codons 12" just after that, at 630..639, highlighting it, and then click on the Location button in the relation tree [important!], the placeholder "<Location>" will be replaced with "codons 12" and a yellow highlighter background will appear on "point mutations" in the text window. The text and relation tree now look like this:

Then you can click on "point mutations" in the text again to start a new relation tree for the chain "codons (+) 22" (630..636), and so on.

To delete a relation tree, press control-D. This will remove the relation without affecting the entity annotations in it. (But be careful of this if you have linked relations!)

Linking Relation Structures

The variation at codon 12 and the one at codon 22 are separate events, even though they share a Location reference. The next variation event mentioned is more complex, describing the states at two levels of specificity, both as nucleotide sequences in the genome and as amino acids produced:

"GGT (glycine, wild-type) to AGT (serine) substitution at codon 12"

(Forget about "wild-type" for now. We don't currently have a way to handle it, and in this variation event it's redundant, or nearly so.)

But the relations created by Wordfreak have only one slot for each kind of entity, and there is no way to put both descriptions into a single relation. Instead, we create enough relations to contain all the descriptions and connect them together. We call this operation "linking" (not to be confused with chaining. Any ideas for less confusing nomenclature will be seriously considered).

Wordfreak has no conception of "levels of specificity" and will let you fill a slot in our relation tree with any entity reference that has the matching label. It's up to you to group the entities appropriately into relation trees, keeping nucleotide States with nucleotide States (and, where relevant, nucleotide Locations like "bp 147"), and likewise for amino acid States and Locations. (Not all linked sets of relations require this.)

You link relation trees after you have created them both. On the first line of each relation tree, between the tiny "key" icon and the text of the entity reference on which the tree was built, is a box with an icon like a couple of links of chain. While an entity in one of the trees is selected, click on the "link" icon of the other tree. Both trees' link icons will turn purple, showing that these trees are linked.

If you realize that you have linked two relations that you didn't want to link, no problem. Just select one of them and click on the other one's link icon and they will be unlinked.

Unrelated Entities

We're not building relations on everything, not nearly!

Variation only

To begin with, this pilot project is concerned only with the Variation entity subtypes. For this purpose ignore all Gene and Malignancy entities.

Don't go too far away

We are looking for succinct descriptions. Usually the entities that describe a single relation will be found pretty close together in the text. They may or may not be in the same sentence, and you don't have to worry about that. So if you see something like "We observed three distinct substitutions. Glycine was replaced by serine in codon 38...", don't feel shy about putting "substitutions", the two states, and the location into a single relation.

Relations can even connect text in the title with text in the abstract body, as in source file 1174 (PMID: 8514604):

The variation type "point mutations", which refers to the variation events at these locations, is not mentioned anywhere in the abstract body and would be lost if it weren't related with them. But don't go looking all over the abstract to find what might or might not go in there.

No coreference *

We have had several different policies about coreference: expressions like "these mutations" or "this codon" that refer to entities described elsewhere in the text. Sometimes they may be tagged as entities and sometimes not. Even if a coreferential phrase is tagged as an entity, do not incorporated into a relation. Coreference is an overall issue that we are leaving for some future date.

No synonyms [2004-11-11]

Sometimes we see the same entity referred to in a single variation event in two synonymous ways:

The word "translocation" and the notation letter "t" are synonymous and refer to the same entity in the same event; the same goes "loss of heterozygosity" and "LOH". There is nothing to be gained by annotating both of them for the event (by means of a linked pair of relations).

In the first case, since all the other information about the event is in the notational formula, use the symbol "t" in the relation rather than the prose word "translocation", as shown below. In the second case both synonyms are part of the prose and I see no principled reason to choose one or the other.

(When this notation is used, the word "translocation" is hardly ever present, so most of the time the variation-type will be just the symbol "t". This is another reason to use the symbol in the relation in this case rather than the word.)

Just plain not related

Many Variation entities don't figure in any relation in the text. For example:

"Mutations at codon 12/13 or codon 61 alter GTP-binding or GTPase activity, respectively." (PMID 8936664)

Here we have three Variation-location references, one of them chained ("codon (+) 13"). But we see no Variation-type or -state mentioned in connection with them ("mutations" is a Variation-event). So these Location references will not get into any relation. [But you don't need all four kinds of Variation entity reference to establish a relation; any two or three will do. See the first screenshot example, "point mutations at both codons 12 and 22", which has just Type and Locations.]

Button, Button, Don't Click the Wrong Button!

Be very much aware of the difference between the buttons in the Relations pane and the buttons in the top part of the Relations Chooser window, which look just like the entity buttons you are used to. Every time you click a Variation entity button in the top part of the Relations Chooser window, you create a new relation tree... and it isn't always easy to get rid of it.

To delete a relation tree, press control-D. This will remove the relation without affecting the entity annotations in it. But if the relation is linked to other relations, all the linked relations will be deleted.

[From alphatest. Is this still the case?]-- There is a known bug in this version of the relation tool: when you click the "-" button at the top right of the Chooser window, Wordfreak deletes not only the currently selected relation tree, but also the entity label on the currently selected string. If you didn't want to delete the entity label and you notice your error right away or soon after, you can use the Undo command at the top of the Edit menu. Otherwise you have to change the Annotation setting to Oncology, retag the string, change the Annotation setting back to Onco Relation, and resume your relation tagging.

Problem Annotations *

Here are some types of entity and text that annotators have had trouble with.

Base pairs

The "X:Y" representation of a nucleotide base pair (G:C, A:T, A:U [only in RNA]) is a single state. (Archive)

Translocation notation

(adapted from Definitions page) There's a fairly standard notation for translocations; e.g.,

        t(1;15)(p36.3;q24.2)
That is:

We tag one of the locations as "state-original" and the other as "state-altered". It doesn't theoretically matter which is tagged as which, but for consistency's sake we tag the one mentioned first as original. So we would tag this piece of notation as

        t(1;15)(p36.3;q24.2)
like this:

t var-type
1  (+) p36.3 var-state-orig
15 (+) q24.2 var-state-alt

with each of the two states being a two-part chain. (Archive)

(This is an exception to the general restriction on chaining. Normally we chain only in coordinate structures, to avoid problems in matching up the treebanking with the entity annotation. I have checked with Ann Bies about this particular kind of chaining within this specific form of notation, and we have agreed that it is acceptable.)

Work Flow

The workflow for this pilot project is being handled through the LAW system in the usual way, in a special workflow, "Onco Relations for 1-05". Only the pass 2 oncology entity annotatorsare working on it.


Present and future

Our plan is to develop this tool to describe several types of relationship:

  1. At the most basic level, to connect references that are closely linked together to form a single conceptual unit, such as the components of a variation event.

  2. At the next level of complexity, a relationship can itself become a component of a larger relationship. For example:
    "ganglioglioma of the right temporo-occipital region in a ten-year-old patient"

  3. At the highest level of structure we would have the relationships that the biomedical researchers are interested in:

      oncology
    • a Variation
    • in a Gene
    • is associated with a Malignancy
      CYP450
    • a Substance
    • inhibits a CYP450 enzyme
    • with a particular measured effectiveness

  4. And in a different dimension we will link relations that are different ways of describing the same event. This oncology expression --
    "G-to-T transversion mutations in the second base of codon 12 (glycine --> valine)"
    -- is currently tagged as follows:

    This single event is described at two levels of specificity, the nucleotide level (*) and the amino acid level (**).

For this pilot study (October-December 2004) we have implemented #1, the basic relationship, and #4, linked descriptions of a single relationship. #2 and #3, the hierarchical relationships, are still to be developed, but do not seem to offer any obstacles in theory or in implementation.


CHANGE NOTES


Annotators' home
Oncology annotators' page

2004-11-12