Enriched Discourse Annotation of PDiT Subset 1.0 (PDiT-EDA 1.0)

Introduction

PDiT-EDA 1.0 (Zikánová et al., 2018) is a treebank with rich annotation of discourse phenomena developed (2017 – 2018) within the project Implicitní vztahy v textové koherenci (Implicit relations in text coherence), i.e. project GA17-03461S of the Grant Agency of the Czech Republic.

The corpus contains extended annotation of discourse relations of a subset of the Prague Discourse Treebank 2.0 (PDiT 2.0, Rysová et al., 2016), a large corpus of Czech journalistic texts annotated manually with explicit discourse relations, and newly adds implicit relations, entity based relations, question-answer relations and other discourse structuring phenomena.

PDiT-EDA 1.0 was published in December 20, 2018 in the Lindat/Clarin repository.

Data, License and Availability

PDiT-EDA 1.0 can be downloaded as a single zip archive from the LINDAT-Clarin repository. It is publicly available under the Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.

After unzipping the downloaded archive, the data can be found in the directory data, where they are further divided into fifteen subdirectories representing individual genres (advice column, collection, comment, critical review, description, invitation, letters from readers, news report, overview, personality-focused interview, readers‘ survey, reflective essay, sports news, topical interview, weather forecast). Annotation of each document is captured in four interlinked files, in accordance with the layer of annotation: word layer (files *.w.gz), morphological layer (*.m.gz), analytical layer (*.a.gz), and tectogrammatical layer(*.t.gz).

The data are stored in the Prague Markup Language format (PML, Pajas and Štěpánek 2008), which is an XML based format for linguistic annotations (esp. treebanks). For the sake of completeness, PML schemata of the files can be found in the directory resources (the schemata are XML files that describe the structure of the annotated files).

How to browse the data

Tree editor TrEd (Pajas and Štěpánek 2008) can be used to open, browse and modify the data. The editor can be downloaded for various platforms from its home page. Please follow the installation instructions specified at the page for your operating system.

After the installation, a few extensions need to be installed:
  1. Start TrEd.
  2. In the top menu, select Setup -> Manage Extensions...; a dialog window with a list of installed extensions appears.
  3. Click on the button "Get New Extensions"; a dialog window with a list of available (not yet installed) extensions appears.
  4. Make sure that at least extensions "Discourse Annotation (discourse)" and "Prague Dependency Treebank 3.0 (pdt30)" are checked to install (if they are not in the list, they may have already been installed).
  5. Click on the button "Install Selected"; the selected extensions (and some dependencies) get installed.
  6. Close all TrEd windows including the main application window and start TrEd again.

Now, TrEd is able to open the data of PDiT-EDA 1.0. To see the annotation of a document on the tectogrammatical layer, open the respective file with extension .t.gz, and switch Mode: (top right corner) to PML_T_Discourse.

In case of troubles with the installation of TrEd or with browsing the data, please contact the authors at (tred at ufal.mff.cuni.cz).

How to search the data

PML Tree Query (PML-TQ; Pajas and Štěpánek 2009) is a powerful client-server based system for querying treebanks, developed primarily for searching in the PDT data. TrEd can be used as a user-friendly graphically oriented client and local search server, using the extension "PML Tree Query Interface for TrEd (pmltq)" (follow the same installation steps as for the discourse and pdt30 extensions described above). Please refer to the PML-TQ web page information on the query language.

TrEd with the PML-TQ extension allows for local search in the data. In TrEd, go to Macros -> Start Tree Query (or press Shift+F3), choose Files (local) as a data source, click on Add List and choose the file data/impl_all.fl (a filelist of all files in the corpus). An entry List: # fl: all implicit will be added to the list of available filelist. Select it and click on OK. The system is now ready to process your queries. Please see the tutorial in the PML-TQ documentation.

How to cite

If you use the corpus data or for whatever other reason wish to refer to the data, please cite the publication of the data:

Šárka Zikánová, Pavlína Synková, Jiří Mírovský: Enriched Discourse Annotation of PDiT Subset 1.0 (PDiT-EDA 1.0). Data/software, Charles University, Prague, Czech Republic, http://hdl.handle.net/11234/1-2906, Dec 2018

Annotation Principles

The annotation of discourse relations in the EDA-PDiT 1.0 was inspired by the annotation scenario of the Penn Discourse Treebank 2.0, which follows a lexically-grounded approach: A discourse connective is a lexical anchor of a discourse relation that holds between two text spans called discourse arguments. The connective signals the sense of the discourse relation (Table 1 gives a list of possible senses).

Annotation of discourse relations in the original PDiT 2.0 only covers explicit relations. To be able to study also other types of cohesive means (and having only limited resources), we have selected a subcorpus (approx. 5%) from the PDiT 2.0 and enriched the original annotation of explicit discourse relations by the
annotation of implicit relations, entity-based relations, question–answer relations and other discourse-structuring phenomena. Our aim was to mark all local connections between discourse arguments and to present a text as a continuous chain of discourse segments, with the following possible types of connections:

  • explicit discourse relations expressed by primary discourse connectives (expressions such as because, if, but etc.)
  • explicit discourse relations expressed by secondary discourse connectives (e.g. the reason is that)
  • implicit discourse relations (without a discourse connective)
  • entity-based relations (relations based on the coreferential connections between discourse arguments)
  • questions (question–answer relation and also a relation between the previous context and the question)
  • lists (e.g. first, ... second, ...)
  • coherence gaps (no relation to the preceding context can be found)
  • specific parts of a text (author, location, heading, caption, etc.)
  • attribution (relation between the author speech and the reported speech)
  • macrostructure (relation between large segments of the text related to the text as a whole)

 

Table 1: List of possible discourse types (senses)
CONTRAST EXPANSION CONTINGENCY TEMPORAL
confrontation conjunction reason–result synchrony
opposition conjunctive alternative     pragmatic reason–result     precedence–succession
restrictive opposition     disjunctive alternative explication  
pragmatic contrast instantiation condition  
concession specification pragmatic condition  
correction equivalence purpose  
gradation generalization    

Corpus Size

PDiT-EDA 1.0 contains annotation of 100 documents (2 592 sentences, 41 877 words) of PDiT 2.0, coming from 15 different genres (see Table 2).

Table 2: Number of sentences and documents annotated in different genres
genre #sentences     #documents
advice 200 10
collection 163 4
comment 171 6
description 170 4
essay 170 4
invitation 170 7
letter 176 6
news 189 15
overview 171 6
person_interv 104 2
review 167 7
sport 196 9
survey 163 8
topic_interv 291 7
weather 91 5

References

Petr Pajas and Jan Štěpánek: Recent Advances in a Feature-Rich Framework for Treebank Annotation. In: The 22nd International Conference on Computational Linguistics - Proceedings of the Conference, The Coling 2008 Organizing Committee, Manchester, UK, ISBN 978-1-905593-45-3, pp. 673-680, 2008.

Petr Pajas and Jan Štěpánek: System for Querying Syntactically Annotated Corpora. In: Proceedings of the ACL-IJCNLP 2009 Software Demonstrations, Association for Computational Linguistics, Suntec, Singapore, ISBN 1-932432-61-2, pp. 33-36, 2009.

Magdaléna Rysová, Pavlína Synková, Jiří Mírovský, Eva Hajičová, Anna Nedoluzhko, Radek Ocelák, Jiří Pergler, Lucie Poláková, Veronika Pavlíková, Jana Zdeňková, Šárka Zikánová: Prague Discourse Treebank 2.0. Data/software, ÚFAL MFF UK, Prague, Czech Republic, http://hdl.handle.net/11234/1-1905, Dec 2016 

Šárka Zikánová, Pavlína Synková, Jiří Mírovský: Enriched Discourse Annotation of PDiT Subset 1.0 (PDiT-EDA 1.0). Data/software, Charles University, Prague, Czech Republic, http://hdl.handle.net/11234/1-2906, Dec 2018