The Prague Dependency Treebank (PDT) contains a large amount of Czech texts with complex and interlinked morphological, syntactic and complex semantic annotation; in addition, certain properties of sentence information structure and coreference relations are annotated at the semantic level. Newer version also contain multiword expression annotation, discourse relation annotation, and various other additions and corrections added since the first full release of PDT 2.0.
PDT is based on the long-standing Praguian linguistic tradition, adapted for the current Computational Linguistics research needs. The corpus itself uses the latest annotation technology. Software tools for corpus search, annotation and language analysis are included. Extensive documentation (in English) is provided as well.
The difference from the separately published original treebanks can be briefly described as follows:
Hajič Jan, Bejček Eduard, Hlaváčová Jaroslava, Mikulová Marie, Straka Milan, Štěpánek Jan, Štěpánková Barbora: Prague Dependency Treebank - Consolidated 1.0. In: Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC 2020), European Language Resources Association, Marseille, France, ISBN 979-10-95546-34-4, pp. 5208-5218, 2020. (pdf)
The Prague Dependency Treebank 3.5 contains the same texts as the previous versions since 2.0; there are 49,431 annotated sentences (over 800 thousand nodes) on all layers, from tectogrammatical to words, and additional sentences on the analytical (surface dependency syntax) and morphological layers of annotation (approx. 1.8 million words in total).
The Prague Dependency Treebank 3.5 is the 2018 edition of the core Prague Dependency Treebank (PDT). It contains all PDT annotation made at the Institute of Formal and Applied Linguistics under various projects between 1996 and 2018 on the original texts. There are other members of the "family" of the Prague Dependency Treebanks, available separately and described elsewhere; search for "Prague Dependency Treebank" in the LINDAT/CLARIN repository.
Quick download link and PID: http://hdl.handle.net/11234/1-2621. The data is provided under CC-BY-NC-SA, 4.0. For proper citation(s) and more details, see below.
The first version of PDT has been published at LDC in 2001. Since then, various branches of PDT have been developed, adding more annotation. Most importantly, the PDT 2.0 added the tectogrammatical layer, which distinguishes the PDT family of treebanks from most other dependency treebanks available. As of January 2018, PDT 3.5 is the current version encompassing all previous versions, corrections and additional annotation. The history of the PDT editions is briefly listed below.
To download the data, please visit the PDT 3.5 item in the LINDAT/CLARIN repository.
To search the treebank please use the PML-TQ (PML Tree Query) service at LINDAT/CLARIN. Please note this leads to search in PDT 3.0, but except for the discourse annotation added later in PDiT 2.0, the data are identical. (PDT 3.5 in PML-TQ is coming soon.)
To properly acknowledge this resource, please cite the following data item in the LINDAT/CLARIN repository:
For LREC papers (separate language resources references):
@languageresource{lrPDT35,
title = {Prague Dependency Treebank 3.5},
author = {Haji\v{c}, Jan and Bej\v{c}ek, Eduard and B\'{e}mov\'{a}, Alevtina
and Bur\'{a}\v{n}ov\'{a}, Eva and Haji\v{c}ov\'{a}, Eva and Havelka, Ji\v{r}\'{\i}
and Homola, Petr and K\'{a}rn\'{\i}k, Ji\v{r}\'{\i} and Kettnerov\'{a}, V\'{a}clava
and Klyueva, Natalia and Kol\'{a}\v{r}ov\'{a}, Veronika and Ku\v{c}ov\'{a}, Lucie
and Lopatkov\'{a}, Mark\'{e}ta and Mikulov\'{a}, Marie and M\'{\i}rovský, Ji\v{r}\'{\i}
and Nedoluzhko, Anna and Pajas, Petr and Panevov\'{a}, Jarmila
and Pol\'{a}kov\'{a}, Lucie and Rysov\'{a}, Magdal\'{e}na and Sgall, Petr
and Spoustov\'{a}, Johanka and Stra\v{n}\'{a}k, Pavel and Synkov\'{a}, Pavl\'{\i}na
and Šev\v{c}\'{\i}kov\'{a}, Magda and Štěp\'{a}nek, Jan and Urešov\'{a}, Zde\v{n}ka
and Vidov\'{a} Hladk\'{a}, Barbora and Zeman, Daniel and Zik\'{a}nov\'{a}, {\v{S}}\'{a}rka
and {\v{Z}}abokrtsk\'{y}, Zden\v{e}k},
url = {http://hdl.handle.net/11234/1-2621},
publisher={Institute of Formal and Applied Linguistics, LINDAT/CLARIN, Charles University},
address={Prague, Czech Republic},
lindat={http://hdl.handle.net/11234/1-2621},
year = {2018} }
For general papers and citations:
@misc{11234/1-2621,
title = {Prague Dependency Treebank 3.5},
author = {Haji\v{c}, Jan and Bej\v{c}ek, Eduard and B\'{e}mov\'{a}, Alevtina
and Bur\'{a}\v{n}ov\'{a}, Eva and Haji\v{c}ov\'{a}, Eva and Havelka, Ji\v{r}\'{\i}
and Homola, Petr and K\'{a}rn\'{\i}k, Ji\v{r}\'{\i} and Kettnerov\'{a}, V\'{a}clava
and Klyueva, Natalia and Kol\'{a}\v{r}ov\'{a}, Veronika and Ku\v{c}ov\'{a}, Lucie
and Lopatkov\'{a}, Mark\'{e}ta and Mikulov\'{a}, Marie and M\'{\i}rovský, Ji\v{r}\'{\i}
and Nedoluzhko, Anna and Pajas, Petr and Panevov\'{a}, Jarmila
and Pol\'{a}kov\'{a}, Lucie and Rysov\'{a}, Magdal\'{e}na and Sgall, Petr
and Spoustov\'{a}, Johanka and Stra\v{n}\'{a}k, Pavel and Synkov\'{a}, Pavl\'{\i}na
and Šev\v{c}\'{\i}kov\'{a}, Magda and Štěp\'{a}nek, Jan and Urešov\'{a}, Zde\v{n}ka
and Vidov\'{a} Hladk\'{a}, Barbora and Zeman, Daniel and Zik\'{a}nov\'{a}, {\v{S}}\'{a}rka
and {\v{Z}}abokrtsk\'{y}, Zden\v{e}k},
url = {http://hdl.handle.net/11234/1-2621},
note = {{LINDAT}/{CLARIN} digital library at the Institute of Formal and Applied Linguistics ({{\\'U}FAL}), Faculty of Mathematics and Physics, Charles University},
copyright = {Creative Commons - Attribution-{NonCommercial}-{ShareAlike} 4.0 International ({CC} {BY}-{NC}-{SA} 4.0)},
year = {2018} }
For "plaintext" reference:
(Hajič et al., 2018)
Hajič, J., Bejček, E., Bémová, A., Buráňová, E., Hajičová, E., Havelka, J., Homola, P., Kárník, J., Kettnerová, V., Klyueva, N., Kolářová, V., Kučová, L., Lopatková, M., Mikulová, M., Mírovský, J., Nedoluzhko, A., Pajas, P., Panevová, J., Poláková, L., Rysová, M., Sgall, P., Spoustová, J., Straňák, P., Synková, P., Ševčíková, M., Štěpánek, J., Urešová, Z., Vidová Hladká, B., Zeman, D., Zikánová, Š. and Žabokrtský, Z. (2018). Prague Dependency Treebank 3.5. Institute of Formal and Applied Linguistics, LINDAT/CLARIN, Charles University, LINDAT/CLARIN PID: http://hdl.handle.net/11234/1-2621.
For footnote references, the following is sufficient in LaTeX papers:
\url{http://hdl.handle.net/11234/1-2621}
Slides and video recordings from the Prague Treebanking for Everyone: A two-day tutorial, Vilem Mathesius Lecture Series 21.