The Prague Czech-English Dependency Treebank 3.0 (PCEDT 3.0) can be downloaded as a single zip archive from the LINDAT-Clarin repository (see the Licence).
After unzipping the downloaded archive, the data can be found in the directory data
, where they are divided into three subdirectories:
cz
- Czech side only, three formats in three subdirectories: pml
, treex
, mrp
en
- English side only, three formats in three subdirectories: pml
, treex
, mrp
parallel
- Czech-English parallel data in the treex format
The fourth subdirectory there (dictionaries
) contains three lexicons related to the corpus:
pdtvallex-4.0.xml
- Czech valency lexicon PDT-Vallex 4.0
engvallex.xml
- English valency lexicon EngVallex 2.0
czech-morfflex-2.0.xz
- Czech morphological dictionary
The three used formats are:
pml
- a PML format (see below) used in the Prague Dependency Treebank since version 2.0 (each document is represented by four files corresponding to four layers: t-layer (tectogrammatics), a-layer (analytics, surface syntax), m-layer (morphology) and w-layer (word layer, tokenized text)
treex
- technically also a PML format, used in the NLP system Treex (all annotation layers are in a single file, allows for parallel data)
mrp
- a JSON-based format used in the CoNLL 2019 and 2020 shared tasks on meaning representation parsing (see Uniform Graph Interchange Format); unlike the PML and Treex formats, the conversion to the MRP format is lossy - it extracts part of the annotation from the t- and w-layers while discarding morphology and surface syntax (Zeman and Hajič 2020)
The Prague Markup Language format (PML, Pajas and Štěpánek 2008) is an XML based format for linguistic annotation (esp. treebanks). With the exception of the MRP, data formats used in the PCEDT are instances of the PML. For the sake of completeness, PML schemata of the files can be found in the directory resources
. (The schemata are XML files that describe the structure of the annotated files.)
Tree editor TrEd (Pajas and Štěpánek 2008) can be used to open and browse the PML data, i.e. data in directories pml
, treex
and parallel
. The editor can be downloaded for various platforms from its home page. Please follow installation instructions specified for your operating system.
pml
, make sure that at least the extension "Prague Dependency Treebank - Consolidated (pdt_c)" is checked to install (if it is not in the list, it may have already been installed).
treex
and parallel
, make sure that at least the extension "EasyTreex - browse and edit Treex files (*.treex, *.treex.gz, *.streex) (easytreex)" is checked to install (if it is not in the list, it may have already been installed).
Now, TrEd is able to open the data in directories pml
, treex
and parallel
of the PCEDT 3.0.
In case of trouble with the installation of TrEd or with browsing the data, please contact the authors at (tred at
ufal.mff.cuni.cz).
PML Tree Query (PML-TQ; Pajas and Štěpánek 2009) is a powerful client-server system for querying treebanks, developed for searching in any data encoded in the PML. The PDT-C as a whole and also all its subcorpora separately are available for searching using the PML-TQ from ÚFAL's public PML-TQ server. There are two clients available:
For further information about the clients, see the PML-TQ clients documentation. For general info about the PML-TQ, please refer to the PML-TQ web page. For documentation about the query language and tutorials, see the PML-TQ user documentation.
There are several PDT-C data sets available for searching (treebank ids for the TrEd client are in parentheses):
pdtc10
): the whole PDT-C 1.0 data (without the English part of the PCEDT and the audio-related layers of the PDTSC)
pdtc10_faust
): the Faust part of the PDT-C 1.0
pdtc10_pcedt-cz
): the Czech PCEDT part of the PDT-C 1.0
pdtc10_pdt
): the PDT part of the PDT-C 1.0
pdtc10_pdtsc
): the PDTSC part of the PDT-C 1.0 (without the audio-related layers)
In case of trouble with the PML-TQ or with searching in the PDT-C data particularly, try to contact the developers at (pmltq at
ufal.mff.cuni.cz).
Pajas, P. and Štěpánek, J.: Recent Advances in a Feature-Rich Framework for Treebank Annotation. In: The 22nd International Conference on Computational Linguistics - Proceedings of the Conference, The Coling 2008 Organizing Committee, Manchester, UK, ISBN 978-1-905593-45-3, pp. 673-680, 2008.
Pajas, P. and Štěpánek, J.: System for Querying Syntactically Annotated Corpora. In: Proceedings of the ACL-IJCNLP 2009 Software Demonstrations, Association for Computational Linguistics, Suntec, Singapore, ISBN 1-932432-61-2, pp. 33-36, 2009.
Zeman, D. and Hajič, J.: FGD at MRP 2020: Prague Tectogrammatical Graphs. In: Proceedings of the CoNLL 2020 Shared Task: Cross-Framework Meaning Representation Parsing, Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-952148-64-4, pp. 33-39, 2020.