PDiT-EDA 1.0 (Zikánová et al., 2018) is a treebank with rich annotation of discourse phenomena developed (2017 – 2018) within the project Implicitní vztahy v textové koherenci (Implicit relations in text coherence), i.e. project GA17-03461S of the Grant Agency of the Czech Republic.
The corpus contains extended annotation of discourse relations of a subset of the Prague Discourse Treebank 2.0 (PDiT 2.0, Rysová et al., 2016), a large corpus of Czech journalistic texts annotated manually with explicit discourse relations, and newly adds implicit relations, entity based relations, question-answer relations and other discourse structuring phenomena.
PDiT-EDA 1.0 was published in December 20, 2018 in the Lindat/Clarin repository.
PDiT-EDA 1.0 can be downloaded as a single zip archive from the LINDAT-Clarin repository. It is publicly available under the Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.
After unzipping the downloaded archive, the data can be found in the directory data
, where they are further divided into fifteen subdirectories representing individual genres (advice column, collection, comment, critical review, description, invitation, letters from readers, news report, overview, personality-focused interview, readers‘ survey, reflective essay, sports news, topical interview, weather forecast). Annotation of each document is captured in four interlinked files, in accordance with the layer of annotation: word layer (files *.w.gz
), morphological layer (*.m.gz
), analytical layer (*.a.gz
), and tectogrammatical layer(*.t.gz
).
The data are stored in the Prague Markup Language format (PML, Pajas and Štěpánek 2008), which is an XML based format for linguistic annotations (esp. treebanks). For the sake of completeness, PML schemata of the files can be found in the directory resources
(the schemata are XML files that describe the structure of the annotated files).
Tree editor TrEd (Pajas and Štěpánek 2008) can be used to open, browse and modify the data. The editor can be downloaded for various platforms from its home page. Please follow the installation instructions specified at the page for your operating system.
Now, TrEd is able to open the data of PDiT-EDA 1.0. To see the annotation of a document on the tectogrammatical layer, open the respective file with extension .t.gz
, and switch Mode:
(top right corner) to PML_T_Discourse
.
In case of troubles with the installation of TrEd or with browsing the data, please contact the authors at (tred at
ufal.mff.cuni.cz).
PML Tree Query (PML-TQ; Pajas and Štěpánek 2009) is a powerful client-server based system for querying treebanks, developed primarily for searching in the PDT data. TrEd can be used as a user-friendly graphically oriented client and local search server, using the extension "PML Tree Query Interface for TrEd (pmltq)" (follow the same installation steps as for the discourse and pdt30 extensions described above). Please refer to the PML-TQ web page information on the query language.
TrEd with the PML-TQ extension allows for local search in the data. In TrEd, go to Macros -> Start Tree Query (or press Shift+F3), choose Files (local) as a data source, click on Add List and choose the file data/impl_all.fl (a filelist of all files in the corpus). An entry List: # fl: all implicit will be added to the list of available filelist. Select it and click on OK. The system is now ready to process your queries. Please see the tutorial in the PML-TQ documentation.
If you use the corpus data or for whatever other reason wish to refer to the data, please cite the publication of the data:
Šárka Zikánová, Pavlína Synková, Jiří Mírovský: Enriched Discourse Annotation of PDiT Subset 1.0 (PDiT-EDA 1.0). Data/software, Charles University, Prague, Czech Republic, http://hdl.handle.net/11234/1-2906, Dec 2018
The annotation of discourse relations in the EDA-PDiT 1.0 was inspired by the annotation scenario of the Penn Discourse Treebank 2.0, which follows a lexically-grounded approach: A discourse connective is a lexical anchor of a discourse relation that holds between two text spans called discourse arguments. The connective signals the sense of the discourse relation (Table 1 gives a list of possible senses).
Annotation of discourse relations in the original PDiT 2.0 only covers explicit relations. To be able to study also other types of cohesive means (and having only limited resources), we have selected a subcorpus (approx. 5%) from the PDiT 2.0 and enriched the original annotation of explicit discourse relations by the
annotation of implicit relations, entity-based relations, question–answer relations and other discourse-structuring phenomena. Our aim was to mark all local connections between discourse arguments and to present a text as a continuous chain of discourse segments, with the following possible types of connections:
CONTRAST | EXPANSION | CONTINGENCY | TEMPORAL |
---|---|---|---|
confrontation | conjunction | reason–result | synchrony |
opposition | conjunctive alternative | pragmatic reason–result | precedence–succession |
restrictive opposition | disjunctive alternative | explication | |
pragmatic contrast | instantiation | condition | |
concession | specification | pragmatic condition | |
correction | equivalence | purpose | |
gradation | generalization |
PDiT-EDA 1.0 contains annotation of 100 documents (2 592 sentences, 41 877 words) of PDiT 2.0, coming from 15 different genres (see Table 2).
genre | #sentences | #documents |
advice | 200 | 10 |
collection | 163 | 4 |
comment | 171 | 6 |
description | 170 | 4 |
essay | 170 | 4 |
invitation | 170 | 7 |
letter | 176 | 6 |
news | 189 | 15 |
overview | 171 | 6 |
person_interv | 104 | 2 |
review | 167 | 7 |
sport | 196 | 9 |
survey | 163 | 8 |
topic_interv | 291 | 7 |
weather | 91 | 5 |
Petr Pajas and Jan Štěpánek: Recent Advances in a Feature-Rich Framework for Treebank Annotation. In: The 22nd International Conference on Computational Linguistics - Proceedings of the Conference, The Coling 2008 Organizing Committee, Manchester, UK, ISBN 978-1-905593-45-3, pp. 673-680, 2008.
Petr Pajas and Jan Štěpánek: System for Querying Syntactically Annotated Corpora. In: Proceedings of the ACL-IJCNLP 2009 Software Demonstrations, Association for Computational Linguistics, Suntec, Singapore, ISBN 1-932432-61-2, pp. 33-36, 2009.
Magdaléna Rysová, Pavlína Synková, Jiří Mírovský, Eva Hajičová, Anna Nedoluzhko, Radek Ocelák, Jiří Pergler, Lucie Poláková, Veronika Pavlíková, Jana Zdeňková, Šárka Zikánová: Prague Discourse Treebank 2.0. Data/software, ÚFAL MFF UK, Prague, Czech Republic, http://hdl.handle.net/11234/1-1905, Dec 2016
Šárka Zikánová, Pavlína Synková, Jiří Mírovský: Enriched Discourse Annotation of PDiT Subset 1.0 (PDiT-EDA 1.0). Data/software, Charles University, Prague, Czech Republic, http://hdl.handle.net/11234/1-2906, Dec 2018