PDT-VALLEX is a valency lexicon of Czech verbs, nouns, adjectives, and adverbs,
which occur at least once in the t-layer data of PDT 2.0. The whole
lexicon is stored in a single XML file
(data/pdt-vallex/vallex.xml
on the PDT 2.0 CD-ROM). The structure of the
file is formally described using Document Type Definition
(data/pdt-vallex/vallex.dtd
)
and, equivalently, using RELAX NG schema (data/pdt-vallex/vallex.rng
).
Linguistic interpretation of the lexicon is explained
in the Manual for tectogrammatical annotation.
The following paragraphs contain a simplified and informal introduction to
the physical structure of the lexicon.
The top-level element of the lexicon is named
<valency_lexicon>
and
consists of three parts: <head>
, <body>
,
and <tail>
. The first and the last served only
for technical purposes during the annotation, e.g. for capturing the
history of changes
or for storing the list of annotators. The core of the
lexicon is formed by the <body>
element.
The <body>
element contains
a sequence of <word>
elements, each of them corresponding
to an individual word entry. Each word entry is associated
with attributes lemma
(corresponding to t-lemma
in tectogrammatical trees; e.g. PDT-VALLEX lemma "bát
se" corresponds to t-lemma "bát_se"),
POS
(semantic part of speech), and
id
(word entry identifier). Besides the
attributes, each word entry contains a sequence of frame entries,
represented by <frame>
elements and embedded in
the <valency_frames>
element.
The <frame>
element corresponds to one of the valency frames
of the lexical unit in question (specified by the attribute
lemma
). Each valency frame has its identifier
stored in the id
attribute (this is the identifier
which is referred to in the nodes on the t-layer), together with several technical
attributes. Each valency
frame must be equipped with an example sentence or sentence fragment
(<example>
element),
illustrating the usage of the frame in Czech.
The valency frame itself is formally represented as a sequence
of valency slots (<element>
elements)
listed in <frame_elements>
.
Each frame slot has its functor (attribute
functor
, specifying the deep syntactic relation of the
slot with respect to its governing lexical unit, such as ACT, ADDR
or LOC), type (attribute
type
, distinguishing between obligatory
and non-obligatory frame slots),
and one or more possible surface realizations represented
by a sequence of <form>
elements (and, in
parallel, also in attribute form
, where the so called
compact notation is used, see the Manual).
There are two types of restrictions on the slot surface form captured
in the form
element: either the restrictions
correspond to one of four special cases
(typical
, elided
,recip
,
or state
), or the surface form is expressed using
a simplified analytical tree prototype. The tree consists of
node
element or elements (embedded in each other,
thus reflecting the tree topology). The constraints on the nodes,
such as pos
, case
, or
lemma
, are stored as attributes of the respective
node
elements.
For instance, a specific prepositional group can be
represented as a tree composed of two nodes: the constraint
on the upper node is the lemma of the preposition, whereas the
constraint on the child node is the case number.