A manually annotated and genre-diversified language resource with rich linguistic information from morphology and syntax to semantics, The Prague Dependency Treebank – Consolidated 2.0 (PDT-C in the sequel) is a consolidated release of the existing PDT-corpora of Czech data, uniformly annotated using the standard PDT scheme.
PDT-corpora included in PDT-C:
The difference from the separately published original treebanks can be briefly described as follows:
In the previous PDT-C 1.0 version, the data was enhanced with a manual linguistic annotation at the morphological layer. For the PDT-C 2.0 version, manual annotation at the analytical layer is performed in those parts of the corpus that were previously annotated only by automatic tools. The goal of the annotation work is also to consolidate the manual annotation across all layers. This resulted in many modifications and corrections to the original annotation. Manual annotation of discourse relations is also now provided for all PDT-C 2.0 data (see also PDiT 4.0). In the PDT-C 2.0 release, there is now a manual annotation at the all annotation layers (morphological, surface syntactic (analytical), deep syntactic layer (tectogrammatical)) in all four datasets. Additional semantic features in the PDT dataset are also manually annotated.
Layers of annotations. The PDT-annotation scheme has a multi-layer architecture:
In addition to the above-mentioned three (main) annotation layers in the PDT-scenario, there is also the raw text layer (w-layer), where the text is segmented into documents and paragraphs and individual tokens are assigned unique identifiers. There is additional audio and speech recognition layer (z-layer) in the spoken data. In the spoken data part (as opposed to the written corpora), the w-layer is in fact also an “annotated” layer, namely the manually provided transcription of the audio signal.
In order not to lose any piece of the original information, tokens (nodes) at a lower layer are explicitly referenced from the corresponding closest (immediately higher) layer. These links allow for tracing every unit of annotation all the way down to the original text, or to the transcript and audio (in the spoken data).
Sarančata jsou doposud ve stadiu larev a pohybují se pouze lezením. V tomto období je účinné bojovat proti nim chemickými postřiky, ale dožívající družstva ani soukromí rolníci nemají na jejich nákup potřebné prostředky.
Example sentences from PDT-C, with tectogrammatical annotation including coreference links (blue and brown arrows), MWEs (red stripes) and discourse annotation (orange arrows and attributes/lables). Lit.: Grasshoppers are still in the larvae stadium, crawling only. At this time of the year, it is efficient to fight them using chemicals, but neither the ailing cooperatives nor private farmers can afford them.
In the current PDT-C 1.0 release, manual annotation has been fully performed at the lowest morphological layer; also, basic phenomena of the annotation at the highest deep syntactic layer (structure, functions, verbal valency) have been done manually in all four datasets. Manual annotation of the surface syntactic layer is contained only in the dataset of PDT written texts. Additional semantic features in PDT dataset have been also done manually. Table 1 presents an overview of various types of annotation at the three annotation layers in each dataset and the information of the manner in which the annotations was carried out.
Dataset / Type of annotation |
PDT Written |
PCEDT (Czech) Translated |
PDTSC Spoken |
PDT-Faust User-generated |
Audio |
non-applicable |
non-applicable |
provided |
non-applicable |
ASR Transcription |
non-applicable |
non-applicable |
provided |
non-applicable |
Transcript |
non-applicable |
non-applicable |
manually |
non-applicable |
Translation |
non-applicable |
manually |
non-applicable |
manually |
Morphological layer |
||||
Speech reconstruction |
non-applicable |
non-applicable |
manually |
non-applicable |
Lemmatization |
manually |
manually |
manually |
manually |
Tagging |
manually |
manually |
manually |
manually |
Surface syntactic layer |
||||
Dependency structure |
manually |
manually |
manually |
manually |
Syntactic function |
manually |
manually |
manually |
manually |
Clause segmentation |
automatically |
not annotated |
not annotated |
not annotated |
Deep syntactic layer |
||||
Deep syntactic structure |
manually |
manually |
manually |
manually |
Deep syntactic function |
manually |
manually |
manually |
manually |
Verbal valency |
manually |
manually |
manually |
manually |
Nominal valency |
manually |
not annotated |
not annotated |
not annotated |
Grammatemes |
manually |
not annotated |
not annotated |
not annotated |
Coreference grammatical |
manually |
manually |
manually |
manually |
Coreference textual |
manually |
manually |
manually |
manually |
Bridging relation |
manually |
not annotated |
not annotated |
not annotated |
Topic-focus articulation |
manually |
not annotated |
not annotated |
not annotated |
Discourse |
manually |
manually |
manually |
manually |
Genre specification |
manually |
not annotated |
not annotated |
not annotated |
Quotation |
manually |
not annotated |
not annotated |
not annotated |
Multiword expressions |
manually |
not annotated |
not annotated |
not annotated |
Table 1: Overview of various types of annotation and their realization in the datasets
The data volume is given in Table 2. Altogether, the consolidated treebank contains 3,885,591 tokens with manual morphological annotation and 2,245,945 t-nodes with manual deep syntactic annotation (manual annotation of the surface syntactic layer is contained only in the dataset of written texts and it consists of 1,503,741 a-nodes).
|
PDT Written |
PCEDT (Czech) Translated |
PDTSC Spoken |
PDT-Faust User-generated |
Total |
Morphological layer (number of m-forms) |
1,957,150 |
1,152,289 |
742,316 |
33,836 |
3,885,591 |
Surface syntactic layer (number of a-nodes) |
1,503,637 |
1,152,289 |
742,316 |
33,836 |
3,432,078 |
Deep syntactic layer (number of t-nodes) |
675,099 |
931,211 |
607,906 |
30,072 |
2,244,288 |
Table 2. Volume of the datasets (number of tokens on the respective layers)
If you use the data in your research or need to cite it for any reason, please cite:
For LREC papers (separate language resources references):
@languageresource{lrPDT-C20,
title={{P}rague {D}ependency {T}reebank - {C}onsolidated 2.0 ({PDT-C} 2.0)},
author={Haji\v{c}, Jan and Bej\v{c}ek, Eduard and B\'{e}mov\'{a}, Alevtina
and Bur\'{a}\v{n}ov\'{a}, Eva and Fu
\v{c}\'{i}kov\'{a}, Eva and Haji\v{c}ov\'{a}, Eva and Havelka, Ji\v{r}\'{\i}
and Hlav\'{a}\v{c}ov\'{a}, Jaroslava and Homola, Petr and Ircing, Pavel and K\'{a}rn\'{\i}k, Ji\v{r}\'{\i} and Kettnerov\'{a}, V\'{a}clava
and Klyueva, Natalia and Kol\'{a}\v{r}ov\'{a}, Veronika and Ku\v{c}ov\'{a}, Lucie and Lopatkov\'{a}, Mark\'{e}ta
and Mare\v{c}ek, David and Mikulov\'{a}, Marie and M\'{\i}rovsk\'{y}, Ji\v{r}\'{\i} and Nedoluzhko, Anna and Nov\'{a}k, Michal
and Pajas, Petr and Panevov\'{a}, Jarmila and Peterek, Nino and Pol\'{a}kov\'{a}, Lucie and Popel, Martin and Popelka, Jan
and Rompoltl, Jan and Rysov\'{a}, Magdal\'{e}na and Semeck\'{y}, Ji\v{r}\'{i} and Sgall, Petr and Spoustov\'{a}, Johanka
and Straka, Milan and Stra\v{n}\'{a}k, Pavel and Synkov\'{a}, Pavl\'{\i}na
and {\v{S}}ev\v{c}\'{\i}kov\'{a}, Magda and {\v{S}}indlerov\'{a}, Jana and {\v{S}}t\v{e}p\'{a}nek, Jan
and {\v{S}}t\v{e}p\'{a}nkov\'{a}, Barbora and Toman, Josef and Ure\v{s}ov\'{a}, Zde\v{n}ka and Vidov\'{a} Hladk\'{a}, Barbora
and Zeman, Daniel and Zik\'{a}nov\'{a}, {\v{S}}\'{a}rka and {\v{Z}}abokrtsk\'{y}, Zden\v{e}k},
url = {http://hdl.handle.net/11234/1-3185},
publisher={Institute of Formal and Applied Linguistics, LINDAT/CLARIAH-CZ, Charles University},
address={Prague, Czech Republic},
lindat={http://hdl.handle.net/11234/1-5813},
year={2024} }
For general papers and citations:
@misc{pdtc20,
title={{P}rague {D}ependency {T}reebank - {C}onsolidated 2.0 ({PDT-C} 2.0)}, author={Haji\v{c}, Jan and Bej\v{c}ek, Eduard and B\'{e}mov\'{a}, Alevtinaand Bur\'{a}\v{n}ov\'{a}, Eva and Fu
\v{c}\'{i}kov\'{a}, Eva and Haji\v{c}ov\'{a}, Eva and Havelka, Ji\v{r}\'{\i} and Hlav\'{a}\v{c}ov\'{a}, Jaroslava and Homola, Petr and Ircing, Pavel and K\'{a}rn\'{\i}k, Ji\v{r}\'{\i} and Kettnerov\'{a}, V\'{a}clava and Klyueva, Natalia and Kol\'{a}\v{r}ov\'{a}, Veronika and Ku\v{c}ov\'{a}, Lucie and Lopatkov\'{a}, Mark\'{e}ta and Mare\v{c}ek, David and Mikulov\'{a}, Marie and M\'{\i}rovsk\'{y}, Ji\v{r}\'{\i} and Nedoluzhko, Anna and Nov\'{a}k, Michal and Pajas, Petr and Panevov\'{a}, Jarmila and Peterek, Nino and Pol\'{a}kov\'{a}, Lucie and Popel, Martin and Popelka, Jan and Rompoltl, Jan and Rysov\'{a}, Magdal\'{e}na and Semeck\'{y}, Ji\v{r}\'{i} and Sgall, Petr and Spoustov\'{a}, Johanka and Straka, Milan and Stra\v{n}\'{a}k, Pavel and Synkov\'{a}, Pavl\'{\i}na and {\v{S}}ev\v{c}\'{\i}kov\'{a}, Magda and {\v{S}}indlerov\'{a}, Jana and {\v{S}}t\v{e}p\'{a}nek, Jan and {\v{S}}t\v{e}p\'{a}nkov\'{a}, Barbora and Toman, Josef and Ure\v{s}ov\'{a}, Zde\v{n}ka and Vidov\'{a} Hladk\'{a}, Barbora and Zeman, Daniel and Zik\'{a}nov\'{a}, {\v{S}}\'{a}rka and {\v{Z}}abokrtsk\'{y}, Zden\v{e}k}, url = {http://hdl.handle.net/11234/1-5813},
note = {{LINDAT}/{CLARIAH-CZ} digital library, Institute of Formal and Applied Linguistics ({{\'U}FAL}), Faculty of Mathematics and Physics, Charles University}, copyright={Creative Commons - Attribution-{NonCommercial}-{ShareAlike} 4.0 International ({CC} {BY}-{NC}-{SA} 4.0)}, year={2024} }
For "plaintext" reference:
(Hajič et al., 2024)
Jan Hajič, Eduard Bejček, Alevtina Bémová, Eva Buráňová, Eva Fučíková, Eva Hajičová, Jiří Havelka, Jaroslava Hlaváčová, Petr Homola, Pavel Ircing, Jiří Kárník, Václava Kettnerová, Natalia Klyueva, Veronika Kolářová, Lucie Kučová, Markéta Lopatková, David Mareček, Marie Mikulová, Jiří Mírovský, Anna Nedoluzhko, Michal Novák, Petr Pajas, Jarmila Panevová, Nino Peterek, Lucie Poláková, Martin Popel, Jan Popelka, Jan Romportl, Magdaléna Rysová, Jiří Semecký, Petr Sgall, Johanka Spoustová, Milan Straka, Pavel Straňák, Pavlína Synková, Magda Ševčíková, Jana Šindlerová, Jan Štěpánek, Barbora Štěpánková, Josef Toman, Zdeňka Urešová, Barbora Vidová Hladká, Daniel Zeman, Šárka Zikánová, Zdeněk Žabokrtský: Prague Dependency Treebank - Consolidated 2.0 (PDT-C 2.0). Data/software, LINDAT-CLARIAH-CZ, URL: http://hdl.handle.net/11234/1-5813, 2024.
For footnote references, the following is sufficient in LaTeX papers:
\url{
http://hdl.handle.net/11234/1-5813}