Table of Contents
This document describes the annotation markup used on the four layers of the Prague Dependency Treebank (PDT) 2.0. PDT 2.0 is distributed in the PML format (see PML Specification). PML schema files for the four layers can be found in the files wdata_schema.xml, mdata_schema.xml, adata_schema.xml, and tdata_schema.xml in the data/schemas on the PDT 2.0 CD-ROM.
For the lower two layers (i.e. the w-layer containing tokens and sentence boundaries and the m-layer containing morphological annotation) we describe all PML constructs defined in the respective PML schemas.
The higher two layers
(a-layer and t-layer) are both dependency tree annotations
and have therefore very similar structure:
their root elements (adata
and
tdata
respectively)
contain only the required PML header
element
followed by the element meta
reserved for meta-data
associated with the annotation,
and the element trees
,
which consists of a PML list of the root-nodes of the trees.
Each node is represented by a PML structure whose members
are the node's attributes, except for the member
children
, which contains
a PML list of the node's child-nodes.
Based on that, Section 3, “Members of nodes in analytical trees (a-layer)”
and Section 4, “Members of nodes in tectogrammatical trees (t-layer)”
only describe members of the PML structures used to represent nodes
of the respective layer.
wdata
This is the root element of the w-layer
annotation. It contains the following elements:
meta
and
doc
.
meta
Represents a structure containing meta information related to the w-layer instance.
meta
:
lang
original_format
Name of the
original format of the annotated data (in PDT 2.0 this is always
csts
which is the format used in the
Czech National Corpus).
doc
This element delimits a document. Each PDT 2.0 w-layer instance contains exactly one document.
id
required member
source_id
This attribute preserves the identifier of the document used in the original text source. In case of PDT 2.0 this is the identifier used in the Czech National Corpus.
docmeta
Meta-data related to the document.
para
mdata
This is the root element of an m-layer instance. It contains the following elements:
meta
A structure containing meta-data related to the m-layer annotation.
lang
annotation_info
Value type: list of structures with members described below
Basic information about the source and types of the annotation(s).
annotation_info/id
required member
annotation_info/version_info
This field can be used to specify the version of the tool(s) involved, etc.
annotation_info/desc
A string describing the type of the annotation, e.g. whether the annotation is manual or automatic, what kind of automatic tool was used to create the annotation, etc.
s
id
required member
The value is a unique identifier of the sentence in PDT 2.0.
s
m
Contains a structure that
organizes attributes of the morphological annotation assigned to a unit of
the w-layer. In case there are more annotations (e.g. manual and
automatic) for a given item, these annotations can be joined in the element
m
as alternatives.
The origin of each alternative is always marked in its member src.rf
. In PDT 2.0
there are no alternatives in the element m
. It only contains manual
annotation.
id
required member
src.rf
This member indicates
source of the morphological annotation. It contains a
reference to the element src
in a header of a morphological PML
file. The source of the annotation can be specified there. Manual
annotation is usually assigned the identifier manual
.
w.rf
Value type: ordered list of values of type PML reference
This member contains one or more references to the units of the w-layer, to which the morphological annotation applies.
form_change
Value type: list of values of the following enumerated type: ctcd
, spell
, insert
, num_normalization
,
The form on the input of
the morphological analysis can differ from the token on the w-layer. This member aims at distinguishing the most common cases of
such a difference. ctcd
marks the cases, where the
original token is systematically split in two forms (e.g.
nač is split into na
and co, similarly in English we could split
wanna into want and
to). Value insert
marks the cases, where the form
does not correspond to any token of the w-layer, usually because of
a typing error in the text. spell
marks the cases, in which the
form was manually fixed (in comparison to the original token). Most
often this happens with mistyped characters, small grammatical errors,
etc. num_normalization
marks the forms
containing canonical forms of numbers derived from tokens by means of
automatic normalization (spaces or commas separating thousands,
leading zeros in the integer part and trailing zeros in the decimal
part are removed, decimal comma is changed to decimal point, + sign is
removed, etc.).
form
required member
Contains word form (not
necessarily basic word form), number or punctuation. It can differ
from token, that actually occurred in text, e.g. in case of corrected
typing errors (see also member form_change
). We distinguish typing
errors (mistakes) and author's informal style (intention); the latter
is not corrected.
lemma
required member
Contains lemma (basic word
form) assigned in the morphological annotation. If necessary,
homonymous lemmas with different meanings are differentiated by
numerical suffixes. Some lemmas have additional information about
meaning and style shades. In PDT 2.0 the string of lemma has
the following format (all parts but BaseForm
are usually optional:
BaseForm
-Number
`Reference
_:Category
_;Term
_,Style
_^(Comment)
.
Number
distinguishing homonymous base forms
Reference
pointer to other lemma used mainly for spelled-out numbers or abbreviations for various units
Category
most categories are POS, only Category
= B for abbreviations is
systematically used, because POS is in the member tag
Term
differentiating named entities (names of persons, geographical names, etc.) or domains of origin of the term
Style
stylistic classification of the lemma
Comment
Explanatory comment of the word meaning (mostly for the homonymous lemmas differentiated by number) or machine readable information about derivation of the lemma from other deeper lemma.
For details see Manual for Morphological Annotation.
tag
required member
Contains positional morphological
tag, which encodes morphological categories relevant for the given word. A
pair lemma + tag unambiguously determines the word form. The tag
consists of 15 positions, for each position there is a set of allowed
characters. Positions correspond to the following categories: 1. part of
speech 2. detailed part of speech 3. gender 4. number 5. case 6.
possessor's gender 7. possessor's number 8. person 9. tense 10. degree
of comparison 11. negation 12. voice 13. reserve 14. reserve 15.
variant, style. The following parts of speech are distinguished: N
- noun; A
- adjective; P
- pronoun; C
- numeral; V
- verb; D
- adverb; R
- preposition; J
- conjunction; T
- particle; I
- interjection; Z
- punctuation; X
- unknown, unclassifiable. For details see Manual for Morphological Annotation.
id
required member
s.rf
This member points to a segment
of text (a sentence) marked s
on the m-layer. The tree is
an analytical annotation of this segment of text.
afun
Contains the analytical
function AuxS
reserved for the root node of an analytical tree.
ord
required member
Value type: non-negative integer
Specifies position in
the horizontal ordering of the nodes in a tree. All the nodes are
ordered by their position in a sentence, only the root node has value
0
.
id
required member
m.rf
This member links the units of
the analytical layer (analytical tree nodes) to the units of the
m-layer. The value of the member is a reference to the
element m
on the
m-layer. While reading the file, an application can remove
this member and substitute it with the member m
, created from the
referenced element of the m-layer.
afun
required member
Value type: enumerated: Pred
, Pnom
, AuxV
, Sb
, Obj
, Atr
, Adv
, AtrAdv
, AdvAtr
, Coord
, AtrObj
, ObjAtr
, AtrAtr
, AuxT
, AuxR
, AuxP
, Apos
, ExD
, AuxC
, Atv
, AtvV
, AuxO
, AuxZ
, AuxY
, AuxG
, AuxK
, AuxX
,
The value is an analytical function assigned to a node. It represents the kind of relation of a node to its parent node. Analytical functions are thoroughly described in the Manual for Analytical Annotation.
In PDT 1.0 the value of afun
may have included one of the
suffixes _Co
, _Ap
and _Pa
. In PDT 2.0 suffixes
_Co
and
_Ap
have
been replaced by the member is_member
and the suffix _Pa
by the member
is_parenthesis_root
(in all cases
with the value 1
). The difference between _Co
a _Ap
is represented
by the value of afun
of the nearest coordination or
apposition node (Coord
or Apos
) on the path to the
root.
is_member
This member is applicable
only to nodes with these properties: the node is a
child node of a node with the analytical function of Coord
or Apos
, or it belongs to a
subtree of a node with the analytical function of Coord
or Apos
and there are only
nodes with the analytical function of AuxC
or AuxP
on the path between the node in
question and the root of the subtree. A node with the given properties
and the value of the member is_member
equal 1
is a part of a coordination or
apposition structure (is a member of a coordination or apposition).
Nodes with the given properties and the value of is_member
not equal
1
are common (joint)
modifications of members of coordination or apposition, in whose subtree
they appear (as stated above). If the member is not filled, its value is
assumed to be 0
.
is_parenthesis_root
Value 1
identifies roots of
subtrees corresponding to parentheses. For historical and technical
reasons, a parenthesis root is not marked if it is also a
coordination or apposition member. See is_member.
ord
required member
Value type: non-negative integer
This member labels nodes of an analytical tree with non-negative integers representing the surface word order. This is the (left to right) order of representing the nodes in graphical applications.
id
required member
atree.rf
This member binds the t-layer and a-layer via a link to a corresponding analytical tree. See Section 2.1. Relation between the tectogrammatical level and the lower levels in Manual for Tectogrammatical Annotation.
nodetype
This member exists only for user's comfort. The value is always set to root, which distinguishes the root node of a tree.
deepord
Value type: non-negative integer
Sets the position in horizontal
ordering of nodes in a tree. For root the value is always set to
0
and, unlike
for other nodes, it does not carry any linguistically relevant
information.
id
required member
a
Binds the node
with units of lower layers. It contains zero, one,
or more identifiers of a-layer nodes that influence the
members t_lemma
, functor
, subfunctor
, val_frame.rf
, or gram
on the t-layer. The
value is a structure of consisting of two parts: lex.rf
and aux.rf
. See Section 2.1. Relation between the tectogrammatical level and the lower levels in Manual for Tectogrammatical Annotation.
Possible value is a structure with the following members:
a/lex.rf
A link to a node of an analytical tree. Usually it is the node from which the tectogrammatical node has acquired its lexical meaning. For details see Section 2.1. Relation between the tectogrammatical level and the lower levels in Manual for Tectogrammatical Annotation.
a/aux.rf
Value type: list of values of type PML reference
A list of references to
nodes on the a-layer. Such nodes most often carry functional
words (prepositions, subordinate conjunctions, auxiliary words etc.)
and they constitute one auto-semantic expression with a node referenced
in the member a/lex.rf
. Value of the member
a/lex.rf
is
not part of a list a/aux.rf
. For details see Section 2.1. Relation between the tectogrammatical level and the lower levels in Manual for Tectogrammatical Annotation.
compl.rf
Value type: list of values of type PML reference
Represents
second dependency of verbal members. It is used for nodes with the
functor COMPL
.
It contains an identifier of the other node in the same tectogrammatical
tree, which also governs the node in question (except for the one to
which the edge leads). See Section 6.10. Predicative Complement (Dual Dependency) in Manual for Tectogrammatical Annotation.
coref_text.rf
Value type: list of values of type PML reference
This member records textual coreference. It contains identifiers of nodes in a tectogrammatical tree that refer to the same entity as this node. See Section 9.3.1.1. Explicit co-referred constituent in Manual for Tectogrammatical Annotation.
coref_gram.rf
Value type: list of values of type PML reference
This member records grammatical coreference. It contains identifiers of nodes in a tectogrammatical tree (usually in the same one) which are in the relation of coreference with this node. See Section 9.2. Grammatical coreference in Manual for Tectogrammatical Annotation.
coref_special
Value type: enumerated: segm
, exoph
,
Records
special types of textual coreference where the coreferred member is
not a single node or subtree of a tectogrammatical tree. Value segm
means that the
coreferred member is a larger segment of text. Value exoph
marks an exophora;
the coreferred member is an unspecified off-text situation. See Section 9.3.1.2. Reference to a segment in Manual for Tectogrammatical Annotation and Section 9.3.1.3. Exophora in Manual for Tectogrammatical Annotation.
val_frame.rf
Value type: alternative of values of type PML reference
A reference into the PDT valency lexicon. The value is the identifier of the valency frame realized by the node (and its subtree). See Section 6.2.2. Valency frames and the way they are recorded in the valency lexicon in Manual for Tectogrammatical Annotation.
nodetype
required member
Value type: enumerated: atom
, coap
, complex
, dphr
, fphr
, list
, qcomplex
,
Value of the member
nodetype
specifies the type of node
atom
coap
complex
dphr
fphr
list
qcomplex
More about node types in Section Chapter 3. Node types in Manual for Tectogrammatical Annotation.
is_generated
Value 1 identifies filled-in nodes. If the member is not present, 0 is implied. See Section 6.12. Ellipses in Manual for Tectogrammatical Annotation.
t_lemma
required member
Contains the node's t-lemma. See Section Chapter 4. Tectogrammatical lemma (t-lemma) in Manual for Tectogrammatical Annotation.
functor
required member
Value type: alternative
of values of the following enumerated type: ACT
, AUTH
, PAT
, ADDR
, EFF
, ORIG
, ACMP
, ADVS
, AIM
, APP
, APPS
, ATT
, BEN
, CAUS
, CNCS
, CM
, COMPL
, CONJ
, COND
, CONFR
, CONTRA
, CONTRD
, CPHR
, CPR
, CRIT
, CSQ
, DENOM
, DIFF
, DIR1
, DIR2
, DIR3
, DISJ
, DPHR
, EXT
, FPHR
, GRAD
, HER
, ID
, INTF
, INTT
, LOC
, MANN
, MAT
, MEANS
, MOD
, OPER
, PAR
, PARTL
, PREC
, PRED
, REAS
, REG
, RESL
, RESTR
, RHEM
, RSTR
, SUBS
, TFHL
, TFRWH
, THL
, THO
, TOWH
, TPAR
, TSIN
, TTILL
, TWHEN
, VOCAT
,
Contains the functor of the node. See Section Chapter 7. Functors and subfunctors in Manual for Tectogrammatical Annotation.
subfunctor
Value type: enumerated: above
, abstr
, across
, after
, agst
, along
, approx
, around
, basic
, before
, begin
, behind
, below
, betw
, circ
, elsew
, end
, ext
, flow
, front
, incl
, in
, less
, mid
, more
, near
, opp
, target
, than
, to
, wout
, wrt
, nr
,
This member, if present, contains a sub-functor particularizing the meaning of the assigned functor. See Section 7.13.1. Subfunctors in Manual for Tectogrammatical Annotation.
is_member
This member is applicable
only for child nodes of nodes with the value of nodetype
=coap
, i.e. of nodes with a
functor for coordination, apposition or operation (OPER
), see also Section 6.6. Parataxis in Manual for Tectogrammatical Annotation and Section 8.11. Mathematical operations and intervals in Manual for Tectogrammatical Annotation. For other nodes this
member is not present. The value is 1
for those child-nodes of nodes with
nodetype
=coap
that represent members of
an coordination or apposition and for operands.
Child nodes (except those with functor
CM
or RHEM
) of nodes
with nodetype
=coap
that do not have is_member
=1
are common (joint)
modifications of all their sibling nodes with is_member
=1
. For nodes with functor CM
(Section 7.12.4. The CM functor in Manual for Tectogrammatical Annotation) the member is_member
is never present and they are considered a part
of the coordinating conjunction. Nodes with functor RHEM
(see Section 7.7. Functors of Rhematizers, Sentence, Linking and Modal Adverbial Expressions in Manual for Tectogrammatical Annotation) have
special rules for positioning in a tree (see Section 10.6.2. Basic guidelines for the position of rhematizers in tectogrammatical trees
in Manual for Tectogrammatical Annotation) and they do not fit the
description above. If the member is not present, value 0
is implied.
For more details see Section 6.6. Parataxis in Manual for Tectogrammatical Annotation.
is_name_of_person
If the value is
1
, the node is
a part of a name of a person. If not filled or present, value
0
can be assumed.
quot
Value type: list of structures with members described below
This member marks the nodes
corresponding to the part of text within “quotation marks”.
There is an unambiguous identifier associated with each part of a text in quotation
marks. This association is expressed in following way: Each node representing part of a quoted text
has an element with set_id
equal to identifier of the
corresponding part of the quoted text between its values of the member
quot
. Thus one
node can be a part of none, one or more sets (embedded quotes) specified
in this way. Section 8.19.1. Text within quotation marks in Manual for Tectogrammatical Annotation
quot/type
required member
Value type: enumerated: citation
, dsp
, meta
, other
, title
,
Determines the
type of use of quotation marks. Value dsp
is used for direct speech, citation
marks formally connected
quotation, meta
meta use, title
proper name and
other
any
other way of using quotation marks. See Section 8.19.1. Text within quotation marks in Manual for Tectogrammatical Annotation.
quot/set_id
required member
Contains an identifier unambiguously delimiting the set of nodes representing the part of text within quotation marks.
is_dsp_root
Value 1
identifies roots
of subtrees capturing direct speech (even if the direct speech is not
indicated by quotation marks). See Section 8.3. Direct speech in Manual for Tectogrammatical Annotation.
sentmod
Value type: enumerated: enunc
, excl
, desid
, imper
, inter
,
The sentence modality - see Section 5.7. The sentmod attribute in Manual for Tectogrammatical Annotation.
gram
This structure is used only for
complex nodes (nodes with value complex
of the member nodetype
). See Section 5.3. Attributes superior to grammatemes in Manual for Tectogrammatical Annotation, Section 5.4. Values of the grammatemes in Manual for Tectogrammatical Annotation and Section 5.5. Grammatemes in Manual for Tectogrammatical Annotation.
Possible value is a structure with the following members:
gram/sempos
required member
Value type: enumerated: n.denot
, n.denot.neg
, n.pron.def.demon
, n.pron.def.pers
, n.pron.indef
, n.quant.def
, adj.denot
, adj.pron.def.demon
, adj.pron.indef
, adj.quant.def
, adj.quant.indef
, adj.quant.grad
, adv.denot.grad.nneg
, adv.denot.ngrad.nneg
, adv.denot.grad.neg
, adv.denot.ngrad.neg
, adv.pron.def
, adv.pron.indef
, v
,
The member sempos
contains information about semantic
part of speech and subgroup of the node. See Section 5.3.1. The sempos attribute in Manual for Tectogrammatical Annotation.
gram/gender
Value type: enumerated: anim
, inan
, fem
, neut
, inher
, nr
,
Grammateme of gender - see Section 5.5.2. The gender grammateme in Manual for Tectogrammatical Annotation.
gram/number
Value type: enumerated: sg
, pl
, inher
, nr
,
Grammateme of number - see Section 5.5.1. The number grammateme in Manual for Tectogrammatical Annotation.
gram/degcmp
Value type: enumerated: pos
, comp
, acomp
, sup
, nr
,
Grammateme of grade - see Section 5.5.8. The degcmp grammateme (degree) in Manual for Tectogrammatical Annotation.
gram/verbmod
Value type: enumerated: ind
, imp
, cdn
, nr
, nil
,
Grammateme of verb modality - see Section 5.5.9. The verbmod grammateme (verbal modality) in Manual for Tectogrammatical Annotation
gram/deontmod
Value type: enumerated: deb
, hrt
, vol
, poss
, perm
, fac
, decl
, nr
,
Grammateme of deontic modality - see Section 5.5.10. The deontmod grammateme (deontic modality) in Manual for Tectogrammatical Annotation
gram/tense
Value type: enumerated: sim
, ant
, post
, nr
, nil
,
Grammateme of tense - see Section 5.5.13. The tense grammateme in Manual for Tectogrammatical Annotation
gram/aspect
Value type: enumerated: proc
, cpl
, nr
,
Grammateme of aspect - see Section 5.5.12. The aspect grammateme in Manual for Tectogrammatical Annotation
gram/resultative
Value type: enumerated: res1
, res0
, nr
,
Grammateme of resultativeness - see Section 5.5.14. The resultative grammateme (resultative aspect) in Manual for Tectogrammatical Annotation
gram/dispmod
Value type: enumerated: disp1
, disp0
, nr
, nil
,
Grammateme of dispositional modality - see Section 5.5.11. The dispmod grammateme (dispositional modality) in Manual for Tectogrammatical Annotation
gram/iterativeness
Value type: enumerated: it1
, it0
, nr
,
Grammateme of iterativeness - see Section 5.5.15. The iterativeness grammateme in Manual for Tectogrammatical Annotation
gram/indeftype
Value type: enumerated: relat
, indef1
, indef2
, indef3
, indef4
, indef5
, indef6
, inter
, negat
, total1
, total2
, nr
,
Grammateme of type of indefiniteness - see Section 5.5.6. The indeftype grammateme in Manual for Tectogrammatical Annotation.
gram/person
Value type: enumerated: 1
, 2
, 3
, inher
, nr
,
Grammateme of person - see Section 5.5.3. The person grammateme in Manual for Tectogrammatical Annotation.
gram/numertype
Value type: enumerated: basic
, set
, kind
, ord
, frac
, nr
,
Grammateme of type of numeral - see Section 5.5.5. The numertype grammateme in Manual for Tectogrammatical Annotation.
gram/politeness
Value type: enumerated: polite
, basic
, inher
, nr
,
Grammateme of politeness - see Section 5.5.4. The politeness grammateme in Manual for Tectogrammatical Annotation.
gram/negation
Value type: enumerated: neg0
, neg1
, nr
,
Grammateme negation - see Section 5.5.7. The negation grammateme in Manual for Tectogrammatical Annotation.
tfa
Value type: enumerated: t
, f
, c
,
Expressed annotation of contextual boundedness.
See Section 10.2. Contextual boundness
in Manual for Tectogrammatical Annotation. Value t
is assigned to nodes representing
contextually (non-contrastively) bound expressions, value c
to nodes representing
contextually contrastively bound expressions and value f
is assigned to nodes
representing contextually non-bound expressions. If the member is not
present, then this property is not applicable to the node (typically this
happens for nodes with the value of the member nodetype
equal to coap
, fphr
or root
).
is_parenthesis
Value 1
marks the nodes
representing expressions that are part of a parenthesis. See Section 6.7. Parenthesis in Manual for Tectogrammatical Annotation. If the member is not present, value 0
is assumed.
is_state
Value 1
marks nodes representing (usually
verbal) modifications with meaning of state. If the member is not
present, value 0
is assumed. See Section 7.13.2. The atribute with the meaning of“state” in Manual for Tectogrammatical Annotation.
deepord
required member
Value type: non-negative integer
This member marks nodes
with non-negative integers in sequence
representing so called deep word order (see Section 10.3. Communicative dynamism
in Manual for Tectogrammatical Annotation).
Ordering determined by the member deepord
is also used for visualizing trees in
graphical applications (left to right with respect to increasing value
of the member deepord
).