http://bioIE.ldc.upenn.edu
Funded by NSF ITR EIA-0205448, Information Technology Research (ITR) program
Please send comments or questions to
Seth Kulick skulick@linc.cis.upenn.edu Institute for Research in Cognitive Science University of Pennsylvania
The .mrg files follow roughly the same format as the .mrg files provided with the Penn Treebank. There are four main differences:
Each tree has a root that is either SENT, SEC, or ORPH. As discussed in the sentence and section documentation, a paragraph consists of sections and sentences, and each tree should correspond to one section or sentence, as indicated by the root of the tree. An ORPH root indicates a section of the file that is not included in any tree. This is an error, of course, and exists only in the cyp files.
Also as disussed in the entity file documentation, sections are chunks of the file which are not of much concern, such as citation and author listing. There is no checking of the tokenization or pos tags, and usually the "tree" for such cases is simply a FRAG. Therefore, if using this corpus for training a parser, we recommend using only the trees with a SENT root node.
Also, in what follows we will sometimes use "section" or "sentence" to refer to either a "section" or "sentence", when it doesn't matter what the precise annotation is.
Each token includes its span as well as POS tag, in the format (POS:span token), where the span is a string [X..Y] that corresponds to the span for that token annotation in the wordfreak annotation file. For example (taken from (onco)source_file_609.mrg).
(SENT
(S
(NP-SBJ-2
(NP (DT:[591..594] The) (NN:[596..608] relationship))
(PP (IN:[609..616] between)
(NP
(NP (DT:[617..620] the) (JJ:[621..631] mutational)
(NN:[632..638] status))
(CC:[639..642] and)
(NP
(NP (JJ:[643..655] histological)
(NML-1 (-NONE-:[655..655] *P*)))
(CONJP (RB:[656..658] as) (RB:[659..663] well) (IN:[664..666] as))
(NP (JJ:[668..687] immunohistochemical)
(NML-1 (NNS:[688..696] features)))))))
(VP (VBZ:[697..700] has) (RB:[701..704] not)
(VP (VBN:[705..709] been)
(VP (VBN:[710..718] assessed)
(NP-2 (-NONE-:[718..718] *))
(PP (IN:[719..721] in)
(NP (NN:[722..728] detail))))))
(.:[728..729] .)))
We include the spans on the page to allow users to coordinate with the entity files as desired. A span [X..Y] refers to characters X to (Y-1).
Note that we also include the span even on null elements. This was done to make all terminal nodes of the parse tree consistent, in that they all have associated spans. A null element has a span [X..X], which refers to no character all. An example in the above is (-NONE-:[655..655] *P*).
A third way in which the trees differ from the PTB that parentheses and brackets can occur as part of tokens, instead of just as tokens themselves, and spaces can also occur inside tokens (such as for long compound names in the cyp domain). We follow the PTB in treating the open and close parentheses as -LRB- and -RRB- and the sqare brackets as -LSB- and -RSB-, and also treat them as such when they occur inside tokens. A token-internal space is output as -SP-. For example, the token [1 beta-3H]androstenedione in (cyp)source_file_1021 is output as
(NN:[1892..1918] -LSB-1-SP-beta-3H-RSB-androstenedione)
Quoting from the introduction to the annotation guidelines (Addendum to the Penn Treebank II Style Bracketing Guidelines: BioMedical Treebank Annotation [pdf] [txt]),
"There has been a change in the formatting of indices from the Penn Treebank style. All indices are now on node labels, and index chains can be viewed as equivalence classes, eliminating the need for cascading index chains. Gapping indices ar enow completely independent of other indices, and are shown with "=" on the node labels of all gapping constituents(both in the template and in the conjuncts)."
As an example of the change in the indices, (cyp)source_file_982.mrg has
(S
(NP-SBJ-1 (DT:[427..428] a) (JJ:[430..440] suprapubic)
(NN:[441..449] catheter))
(VP (VBD:[450..453] had)
(S
(NP-SBJ-1 (-NONE-:[453..453] *))
(VP (TO:[454..456] to)
(VP (VB:[457..459] be)
(VP (VBN:[460..470] introduced)
(NP-1 (-NONE-:[470..470] *))
(ADVP (RB:[471..482] temporarily))
[...]
Consider the index chain consisting of
NP-SBJ-1,NP-SBJ-1, and
NP-a. The indices for the last two
are on those labels instead of
the empty elements (-NONE- *), and they are all
represented as one chain, instead of two, as it
would have been in the Penn Treebank:
(S (NP-SBJ-1 a suprapubic catheter)
(VP had
(S (NP-SBJ-2 *-1)
(VP to
(VP be
(VP introduced
(NP *-2)
(ADVP temporarily)
(PP-PRP for
(NP urine drainage))))))))
Each .mrg file has information apart from the trees in lines beginning with a semicolon.
The overall format of each file is as follows, where material in [] may be more than one line.
;PMID ;file name ;Release 0.9 of PennBioIE ;Funded by NSF ITR EIA-0205448 ;[section matching information] ;[section matching errors, if any] ;[token matching errors, if any] (next four parts repeated for each sentence/section) ; sentence/section number and span ; text of sentence/section ; [entities in sentence/section, if any] TREE ; [sentence/section errors, if any]For example, (onco)source_file_666.mrg starts with:
;PMID: 9179693
;source_file_666.mrg
;Release 0.9 of PennBioIE
;Funded by NSF ITR EIA-0205448
;0)section:[e:0..32] = [t:0..32]
;1)sentence:[e:37..110] = [t:37..110]
;2)section:[e:114..181] = [t:114..181]
;3)section:[e:185..258] = [t:185..258]
[...]
;section 0 Span:0..32
;Int J Urol 1997 Mar;4(2):178-85
(SEC
(FRAG (NNP:[0..3] Int) (NNP:[4..5] J) (NNP:[6..10] Urol) (CD:[12..16] 1997)
(CC:[17..23] Mar;4-LRB-) (CD:[23..24] 2) (-RRB-:[24..25] -RRB-)
(CD:[25..29] :178) (::[29..30] -) (CD:[30..32] 85)))
;sentence 1 Span:37..110
;Screening of H-ras gene point mutations in 50 cases of bladder carcinoma.
;[50..55]:gene-rna:"H-ras"
;[61..76]:variation-type:"point mutations"
;[92..109]:malignancy:"bladder carcinoma"
(SENT
(NP-HLN
(NP (NN:[37..46] Screening))
(PP (IN:[47..49] of)
(NP
(NML (NN:[50..55] H-ras) (NN:[56..60] gene))
(NN:[61..66] point) (NNS:[67..76] mutations)))
(PP (IN:[77..79] in)
(NP
(NP (CD:[80..82] 50) (NNS:[83..88] cases))
(PP (IN:[89..91] of)
(NP (NN:[92..99] bladder) (NN:[100..109] carcinoma)))))
(.:[109..110] .)))
The line
;1)sentence:[e:37..110] = [t:37..110]indicates that there is a sentence in the entity file (e:) with span 37..100 that (as expected) matches the sentence in the treebank file (t:) with span 37..110, and so on for the rest of the lines in the section matching information.
The sections and sentences are numbered starting from 0, with each additional entry being either a section or sentence (that is, it is section 0 and then sentence 1, not section 0 and sentence 0).
The entry for section 0 shows its span [0..32], its text, and then the tree, which is rooted with a FRAG. It is usually the case that sections are treebanked as just a FRAG with no structure, since it is often just a citation or author listing.
The entry for sentence 1 is more interesting. It has span [37..110] and its text is shown. There are entities included in this tree, and so these are listed before the tree. Each entity is in the form ;span:type:text, as in
;[50..55]:gene-rna:"H-ras"This information is redundant to the information in the separate entity listing, but we include it here as well for convenience in looking at the entities and trees together.
It is possible that the entity is a chain, in which case it would be displayed with each component span and component text listed separately, as in (taken from (onco)source_file_1192.mrg)
;[890..895]...[902..904]:variation-location:"codon"..."13"which is a chain consisting of the two annotations [890..895]("codon") and [902..904]("13")
Any known errors associated with this tree are then listed, which in this case are none. A discussion of all the possible errors follows in the next section.
As discused above, there are comments provided throughout the .mrg files indicating errors in either the treebanking or something that is inconsistent with the corresponding entity file.
The oncology domain treebank files are cleaner than the cyp files, and have no known consistency problems with then entity files. All of the errors in the oncology domain are one of the three listed under Miscellaneous Sentence/Tree Errors The cyp files have more errors because they have not yet undergone the process of reconciliation with the entity files.
The first two of these errors indicate that the tree has more than one child of the SEC or SENT. The root of each tree is a SEC or SENT. However, this is meant to just indicate what kind of sentence-level annotation the tree corresponds to, and the SEC or SENT should have only one child, which is the meaningful root of the syntactic tree. If SEC or SENT has more than one child, that is an error. For example, from (onco)source_file_1212. (edited for brevity)
;sentence 8 Span:939..1178
;Sixteen of the 42 (38%) moderately differentiated carcinomas, and two of the
;eight (25%) well differentiated carcinomas contained K-ras mutation in codon
;12, but none of the three poorly differentiated carcinomas contained the
;mutation.
(SENT
(S ...)
(,:[1097..1098] ,) (CC:[1099..1102] but)
(S
(NP-SBJ
(NP (NN:[1103..1107] none))
(PP (IN:[1108..1110] of)
(NP (DT:[1111..1114] the) (CD:[1115..1120] three)
(ADJP (RB:[1121..1127] poorly) (VBN:[1129..1143] differentiated))
(NNs:[1144..1154] carcinomas))))
(VP (VBD:[1155..1164] contained)
(NP (DT:[1165..1168] the) (NN:[1169..1177] mutation))))
(.:[1177..1178] .))
;ERROR_Sentence has multiple children (TB Error):[939..1178]::S:,:CC:S:.:
As discussed in the section on the .mrg file format, the sentence is followed by any errors related to this sentence, as is the case here. What this error is indicating is that SENT has more than one child, and the types of those children are S, ",", "CC", "S", and ".". The (TB Error) in the error name indicates that the code is guessing that this is probably a Treebank error. It will sometimes guess that it is a sentence annotation error, as in this example, from (onco)source_file_2858:
;sentence 5 Span:413..600
;In the pediatric malignancy NB(2), Bcl-2 is highly expressed. In tumors with
;a poor prognosis, N-Myc, a protein homologous to c-Myc, is overexpressed as
;a result of gene amplification.
(SENT
(S
(PP (IN:[413..415] In)
(NP
(NP (DT:[416..419] the) (JJ:[421..430] pediatric)
(NN:[431..441] malignancy))
(NP (NN:[442..447] NB-LRB-2-RRB-))))
(,:[447..448] ,)
(NP-SBJ-1 (NN:[449..454] Bcl-2))
(VP (VBZ:[455..457] is)
(ADVP (RB:[458..464] highly))
(VP (VBN:[465..474] expressed)
(NP-1 (-NONE-:[474..474] *))))
(.:[474..475] .))
(S
(PP-LOC (IN:[476..478] In)
[...]
;ERROR_Sentence has multiple Children (Sentence Error)[413..600]::S:S:
sentence 5 clearly consists of two actual sentences, each one reasonably treebanked with a top node of S. In this case the error will have to be fixed in both the entity and the treebank files.
ERROR_Entity in Section is not actually not a treebank error, but rather a potential sentence/section annotation error. It indicates that an entity is found in a section, which is unusual and probably an error. sections are things that are not as much of a concern - citations, authors, etc. For example (from (onco)source_file_669.mrg)
;section 9 Span:1639..1885
;These results showed that K- ras mutations are frequent in histologically
;normal cells taken from outside lung adenocarcinomas and suggest that some
;of these mutations may represent early events which could pave the way of
;lung carcinogenesis.
;[1666..1672]:gene-rna:"K- ras"
;[1747..1767]:malignancy:"lung adenocarcinomas"
(SEC
(S
(NP-SBJ (DT:[1639..1644] These) (NNS:[1645..1652] results))
...
;ERROR_Entity in Section[1666..1672]:gene-rna "K- ras"
;ERROR_Entity in Section[1747..1767]:malignancy "lung adenocarcinomas"
This example is indicating that section 9 has two entities in it. And indeed, this clearly should have been annotated as a sentence, not a section, and will have to be fixed in both the entity annotation and the treebank files.
These five errors often go together and so we discuss them as a group. As mentioned above, for the oncology files the sentence and section annotation are identical in the entity and treebank files, but that is not the case for all the cyp files, and so there are mismatches in the sentence matching information. For example, (cyp)source_file_843 starts with:
;PMID: 1872951 ;source_file_843.mrg ;Release 0.9 of PennBioIE ;Funded by NSF ITR EIA-0205448 ;0)section:[e:0..31] = [t:0..31] ;1)sentence:[e:37..191] = [t:37..191] ;2)section:[e:195..266] = [t:195..266] ;3)section:[e:270..388] = [t:270..388] ;4)sentence:[e:392..706] = [t:392..706] ;5)sentence:[e:708..1096] = [t:708..1096] ;6)sentence:[e:1097..1278] = [t:1097..1278] ;7)sentence:[e:1279..1393] = [t:null] ;8)section:[e:null] = [t:1282..1393] ;9)sentence:[e:1394..1577] = [t:1394..1577] ;10)sentence:[e:1578..1728] = [t:1578..1728] ;11)sentence:[e:1729..1888] = [t:1729..2094] ;12)sentence:[e:1890..2094] = [t:null] ;13)sentence:[e:2095..2253] = [t:2095..2253] ;14)sentence:[e:2254..2421] = [t:2254..2421] ;15)sentence:[e:2422..2581] = [t:2422..2581] ;16)section:[e:2585..2629] = [t:2585..2629] ;Sentence Matching Errors ;ERROR_Different number of sections entity has 16 tree 15 ;ERROR_Entity section missing treebank section[1279..1393] ;ERROR_Tree section missing entity section[1282..1393] ;ERROR_Section End Mismatch[e:1729..1888][t:1729..2094] ;ERROR_Entity section missing treebank section[1890..2094] ;ERROR_Entity not in any tree[1279..1297] substance "BP-7,8-dihydrodiol" ;ERROR_Entity not in any tree[1328..1337] quantitative-value "38 to 77%" ;ERROR_Entity not in any tree[1367..1370] quantitative-value "50%" ;ERROR_Entity not in any tree[1958..1964] substance "adduct" ;ERROR_Entity not in any tree[1966..1982] substance "BP diol epoxide" ;ERROR_Entity not in any tree[2015..2029] substance "deoxyguanosine" ;ERROR_Entity not in any tree[2072..2074] substance "PB"
The sentence matching information shows that the entity file has a sentence at 1279..1393 while there is no correponding sentence in the treebank file, although the treebank file does have a section at 1282..1393. This is a relatively minor case, but entries 11-12 show that the span 1729..2094 is a single sentence in the treebank file while it has been split into two sentences in the entity file.
;ERROR_Different Number of Sections entity has 16 tree 15indicates that the entity annotation has 16 sections/sentences while the treebank file only has 15.
;ERROR_Entity section missing treebank section[1279..1393]indicates that the entity section [1279..1393] has no corresponding treebank section. This is just an error flagging the line ;7)sentence:[e:1279..1393] = [t:null] in the sentence matching info.
;ERROR_Tree section missing entity section[1282..1393]indicates that the tree section [1282..1393] has no corresponding entity section. This is just an error flagging the line ;8)section:[e:null] = [t:1282..1393] in the sentence matching info.
;ERROR_Section End Mismatch[e:1729..1888][t:1729..2094]The way that the section/sentence matching between the entity and treebank file works is that they are matched just on the beginning of the annotation. This error flags when the ending spans are different, as they are here for the correspnoding sentences. This is just an error flagging the line ;11)sentence:[e:1729..1888] = [t:1729..2094] in the sentence matching info
;ERROR_Entity not in any tree[1279..1297] substance "BP-7,8-dihydrodiol"indicates that the entity at [1279..1297] is not contained within any tree. This naturally follows from
;ERROR_Entity section missing treebank section[1279..1393]and this error is generated for every entity that is inside that treebank section, and likewise for every entity contained in [1890..2094], since that section in the entity file also does not have a matching tree, as indicated by
;ERROR_Entity section missing treebank section[1890..2094]
These errors indicate inconsistencies in tokenization and pos tagging between the entity and treebank files. For example, from (cyp)source_file_1786.mrg:
;Token/POS Errors ;ERROR_Token in entity file but not tree[58..60] NO ;ERROR_Token in entity file but not tree[60..61] - ;ERROR_Token in entity file but not tree[61..69] medicate ;ERROR_Token in tree file but not entity[58..69] NO-medicateThis is indicating that the treebank file has a single token at [58..69] for the text NO-medicate and that this token does not exist in the entity file. At the same time, the entity file has three tokens in that span, NO,-, and medicate, none of which are in the treebank file. So the characters from [58..69] have been tokenized differently.
ERROR_POS mismatch indicates that a token has a different POS tag in the entity and treebank files. It is usually the case that the treebank file has just the default POS tag "token". For example, from (cyp)source_file_630:
;ERROR_POS mismatch [41..43] t:token e:CD 10indicating that the token at [41..43], with text 10, has a POS tag CD in the entity file but just a token tag in the treebank file.
It is sometimes the case, in error, that there is text in the underlying source file that is not part of any tree in the treebank file for that source file. Before creating each tree representation for the .mrg file, the code checks for and outputs any text that is between the end of the previous tree and the beginning of the new one. It wraps all such text inside an ORPH label. For example, from (cyp)source_file_1010.mrg:
;section 2 Span:125..168
;Falke HE, Degenhart HJ, Abeln GJ, Visser HK
(SEC
(FRAG (NNP:[125..130] Falke) (NNP:[131..133] HE) (,:[133..134] ,)
(NNP:[135..144] Degenhart) (NNP:[145..147] HJ) (,:[147..148] ,)
(NNP:[149..154] Abeln) (NNP:[155..157] GJ) (,:[157..158] ,)
(NNP:[159..165] Visser) (NNP:[166..168] HK)))
(ORPH .)
;sentence 3 Span:173..348
;Corticosterone production by isolated rat adrenal cells in the absence of
;ACTH is stimulated by 20alpha-hydroxycholesterol, 22R-hydroxycholesterol and
; 25-hydroxycholesterol.
There is a period between section 2 and sentence 3 that is not part of
any tree. In this case, it happens to be the ending period from the
citation section 2. This is also listed as an error, and we make the
convention that orphan text is considered an error associated with the
following sentence. So in this example at the end of the tree for
sentence 3 there is:
;ERROR_Orphan Text from Tree File[168..173] .indicating the text range that contains the orphan text. (The span 168..173 is greater than the range for just the period since 168..173 is the complete span between the two trees, and includes some white space. Of course, it is often the case that there is whitespace, such as a newline, between two trees, but such material between two trees is only displayed as orphan text and listed as an error if there is also non-whitespace material between the trees.)
The intention here is that together all (SENT, (SEC, and (ORPH trees will cover the entire source file, while there will be additional info about the orphan cases listed as errors.
As might be expected, such orphan errors often co-occur with related errors. In (cyp)source_file_1010, there is also the following error listed under Token/POS Errors:
;ERROR_Token in entity file but not tree[168..169] .which indicates that the period in question was properly tokenized and included in the entity file. Also the following is listed under Sentence Matching Errors
;ERROR_Section end mismatch[e:125..169][t:125..168]indicating that the sections that matched starting with 125 have a different ending point. A special case of the orphan token error is when it occurs at the end of the file. For example, again in source_file_1010, the file ends with:
;section 10 Span:1183..1225
;PMID: 172394 [PubMed - indexed for MEDLINE
(SEC
(FRAG (NNP:[1183..1187] PMID) (::[1187..1188] :) (CD:[1189..1195] 172394)
(-LRB-:[1196..1197] -LSB-) (NNP:[1197..1203] PubMed) (::[1204..1205] -)
(VBN:[1206..1213] indexed) (IN:[1214..1217] for)
(NNP:[1218..1225] MEDLINE)))
;ERROR_Orphan Text from EOF Tree File[1225..EOF] ]
(ORPH -RSB-)
which indicates that the ending text (a right square bracket) was not
included in the last tree. For this case, we make the convention that
the orphan error is listed with the last sentence/section of the file.
There is a slight inconsistency in that in the comment indicating the
error, the missing text is output as is, as "]". In the
ORPH pseudo-tree,
following the conventions in the rest of the trees, as discussed
in the section on Token-Internal Special Characters,
it is listed as -RSB-.