Documentation

This page describes the Prague Discourse Treebank 4.0 (PDiT 4.0, Synková et al. 2024a) and summarizes changes in the annotation of discourse relations carried out after the previous publication of discourse relations in the Prague Discourse Treebank 3.0 (PDiT 3.0, Synková et al. 2022). For details on previous versions of the underlying Prague Dependency Treebank (PDT) and the separate releases of the Prague Discourse Treebank (PDiT), please refer to their respective documentations:

1 Introduction

Prague Discourse Treebank 4.0 (PDiT 4.0) annotates all four subcorpora of the underlying Prague Dependency Treebank - Consolidated 2.0:

  • Prague Dependency Treebank (PDT); in the previous versions, PDiT only covered these data
  • Prague Czech-English Dependency Treebank, the Czech part (PCEDT-cz); newly annotated in PDiT 4.0
  • Prague Dependency Treebank of Spoken Czech (PDTSC); newly annotated in PDiT 4.0
  • Faust; newly annotated in PDiT 4.0

2 Annotation process

While the PDT part of PDiT 4.0 had been annotated manually and iteratively checked and improved in the previous releases of PDiT, the three new parts of PDiT 4.0 were annotated partially automatically:

  • PDT - manually (extensive changes that were published in the previous version of PDiT (3.0) described in Synková et al. 2024b)
  • PCEDT-cz - annotation projection from PDTB 3.0 combined with automatic discourse parsing, discrepancies checked manually (described in detail in Mírovský et al. 2024)
  • PDTSC - automatic discourse parsing, the most problematic parts checked manually
  • Faust - automatic discourse parsing, all relations checked manually

2.1 PDT

PDiT 4.0 covered fully the PDiT 3.0 data.

2.2 PCEDT-cz

In PCEDT-cz, we utilize three different sources of information and combine them to get the discourse annotation: (i) annotation projection from the Penn Discourse Treebank 3.0, (ii) manual tectogrammatical (deep syntax) representation of sentences of the corpus, and (iii) the lexicon of Czech Discourse Connectives CzeDLex. Sources (ii) and (iii) were used for discourse parsing of whole corpora, so at begining, the data contained PDTB (Penn) relations  from annotation projection and PDT (Prague) relations from automatic discourse parsing.

By comparing these two types of annotation, three situations were detected: i) Penn relation and Prague relation overlapped (i.e. arguments were the same, type and sense were compatible and connectives at least partly overlapped), ii) Penn relation and Prague relation partly overlapped and iii) just Penn relation or Prague relation occured in the given context. All these situations and their subtypes (e.g. a partial match had different properties depending on whether it involved start of target node of argument) were studied at sample of approx. 200 sentences (overall, approx. 2 thousand positions in the data were inspected manually during various stages of preparation) and rules for merging Prague relations with Penn relations were defined for individual types of situations.

For intra-sentential relations, the arguments (start and target nodes) of Prague relations were usually reliable, as the parser took
advantage of the manual annotation of the tectogrammatical tree and (also manually annotated) morphological information of the nodes, while the PDTB arrows suffered from wrong alignment. On the contrary, for inter-sentential relations the parser could not rely as much on the structure of the tectogrammatical trees and the PDTB arrows proved more reliable for detecting less standard positions of the arguments.

After solving as many as possible discrepancies automatically, the final discourse annotation was reached by manual inspection of all the remaining problematic cases. Overall, PCEDT-cz contains approx. 29 thousand relations, 6 thousand positions were checked to verify the existence of a relation and its arguments, and 8 thousand positions were inspected to check the discourse type or sense. (Details of the process are described in Mírovský et al., 2024.)

2.3 PDTSC

At first, discourse relations in PDTSC were annotated fully automatically by discourse parser itteratively improved during annotation of PCEDT-cz. As part of the data (approx. 1000 sentences) were annotated fully manually in previous stages of work, automatic annotation was then compared to the manual and main discrepancies were detected.

While more discrepancies were found in annotation of inter-sentential relations (ie. relation between sentences in contrast to relations occuring within one compound sentence), the revision of annotation proceeded from inter-sentential to intra-sentential relations. Based on lists of all connective - relations combinations in the data and their frequencies, revisions proceeded also from less common combinations of connectives and types and highly ambiguous connectives to more common combinations and less ambiguous connectives.

Besides, special attention was paid to opposition relation, because parser did not distinguish all subtypes of contrast relations (all of them were thus covered by opposition), and to all pragmatic relations, because there are no formal clues by which the parser could detect them (all pragmatic relations were thus covered by "normal" types - pragmatic condition by condition, pragmatic opposition by opposition and pragmatic reason-result by reason-result).

In total, PDTSC contains 31 thousand of relations, more than 7 thousand of inter-sentential relations were checked manually as well as more than 15 thousand of intra-sentential relations (discourse type was revised in 6,400 cases, extend of argument was changed in 4,100 cases, connective was modified in 1,400 cases). More than 8 thousand of intra-sentential relations remain in automatic version (those with reliable combination of discourse connective and type of relation). Besides, revisions of automatic relations led to annotation of almost 1,200 new relations (often in contexts with more connectives for more than one relation between same arguments).

As a final step, expressions homonymous with most frequent connectives not added to discourse relations as connectives were checked in special construction (e.g. in second part of coordinated structure, in relative clauses) to find out whether some discourse relations with more complicated nature were not present in the data. In such a way just a few new relations were detected.

2.4 Faust

Discourse relations in Faust were first annotated automatically and then all checked manually. Expressions homonymous with most frequent connectives not added to discourse relations as connectives were checked in the whole data to find out whether some discourse relations with more complicated nature were not present in the data. In such a way we detected just a few new relations.

3 Annotation extent

The annotation covers intra- and inter-sentential discourse relations marked by explicitly expressed primary or secondary connectives. Whereas primary connectives are grammaticalized, mostly one-word expressions (like a “and”, ale “but”, když “when”, protože “because”), secondary connectives are not (yet) fully grammaticalized expressions (cf. z tohoto důvodu “for this reason”, za těchto podmínek “under these conditions”, kvůli tomu “due to this” etc.). Discourse relations annotated in the corpus hold between two spans of text (containing finite verbs) called discourse arguments. In the tectogrammatical level of the treebank, the relations are captured by an arrow leading between two verbal nodes (or their coordinations) representing whole arguments (see Figure 1). Each relation is also provided a discourse type (a semantico-pragmatic label such as reason-result, condition, purpose etc., see Table 1) and by the exact extent of the discourse arguments.

Figure 1: S ohledem na toto ustanovení by se hrubé chování muselo týkat vaší osoby a nestačí pouze nevhodné zacházení s předmětem darovací smlouvy, to je darem. Z tohoto důvodu by byla vaše žaloba na vrácení daru u soudu zamítnuta.

[With regard to this provision, the abusive behaviour would have to be related to your person and an inappropriate treatment of the subject of the donation contract is not enough. For this reason, your action on the return of the donation would be rejected at the court.]

 

Table 1: List of possible discourse types
CONTRAST EXPANSION CONTINGENCY TEMPORAL
confrontation conjunction reason–result synchrony
opposition conjunctive alternative     pragmatic reason–result     precedence–succession
restrictive opposition     disjunctive alternative explication  
pragmatic contrast instantiation condition  
concession specification pragmatic condition  
correction equivalence purpose  
gradation generalization  

4 Transformation of Prague discourse types to Penn senses

Transformation of Prague discourse types to Penn senses took place for all four subcorpora with same rules used for all parts. The trasformation process consisted of two separate parts: (i) generation of plain text form of the arguments and connectives from their representation on the tectogrammatical layer, and (ii) transformation of Prague discourse types to Penn senses. The details of the whole process are described in Mírovský et al. (2023), short version in Mírovský et al. (2024) or here.

(i) The numerous issues in extracting plain text forms of the arguments can be split in two categories: (a) annotation inconsistencies in various parts of the data (on the deep-syntactic layer, on the surface-syntactic layer, in the discourse annotation), and (b) a complex nature of the deep-syntactic layer of annotation (reconstructed nodes/parts of the trees that take part in discourse relations, necessity to combine information from several annotation layers). Although we took great care in tuning the plain text generation of the arguments, we could not check and fix errors in all discourse relations.

(ii) Most of the relations could be transformed automatically, as a single Penn counterpart corresponded to the Prague discourse type. However, in many cases there were more than one option; special attention was paid to the following PDTB 3.0 relations: Similarity, negative Condition and negative Result were analyzed in the data and formal clues for their distinguishing from the conjuntion, condition and reason-result were used for their automatic annotation. In the contrast, there were no reliable formal clues for transformation of  pragmatic relations (pragmatic opposition, pragmatic reason-result, pragmatic condition), explication and restrictive opposition to respective Penn senses, so the corresponding senses were assigned to these Prague types in all four subcorpora mostly (in PCEDT, some labels were usable due to annotation projection) or fully (other subcorpora) manually.

All common Penn sense counterparts of Prague discourse types are displayed in Table 2.

Table 2: Transformation of PDiT discourse types to  PDTB 3.0 senses (second-level ones)
PDiT discourse type PDTB 3.0 sense(s)
COMPARISON  
concession Comparison.Concession
confrontation Comparison.Contrast
correction Expansion.Substitution
gradation Expansion.Conjunction
opposition Comparison.Concession
pragm. contrast Comparison.Concession+B,
  Comparison.Concession+SA,
  Comparison.Concession
restrictive opposition Expansion.Exception,
  Comparison.Contrast
  Comparison.Concession
CONTINGENCY  
condition Contingency.Condition,
  Contingency.Neg-condition
explication Contingency.Cause+B,
  Expansion.Level-of-detail
  Contingency.Cause+SpeechAct
purpose Contingency.Purpose
pragm. reason-result Contingency.Cause+B,
  Contingency.Cause+SA,
  Contingency.Cause
pragm. condition Contingency.Condition+SA,
  Contingency.Neg-condition+SA,
  Contingency.Condition
reason--result Contingency.Cause,
  Contingency.Neg-cause
EXPANSION  
conjunction Expansion.Conjunction,
  Comparison.Similarity
conj. alternative Expansion.Disjunction
disj. alternative Expansion.Disjunction
equivalence Expansion.Equivalence
generalization Expansion.Level-of-detail
instantiation Expansion.Instantiation
specification Expansion.Level-of-detail
TEMPORAL  
precedence-succession     Temporal.Asynchronous
synchrony Temporal.Synchronous

 

5 Changes from the previous release

In connection with the changes made to the tectogrammatical trees for PDT-C 2.0, some start or target nodes have moved in the tree structure or have dissapeared completely.  The discourse relations annotation was most affected by those of these changes related to the reinterpretation of ellipses and rules for nominal versus verbal coordination. All cases where a change occured were detected automatically and then manually checked by experienced anntoator.  In total, almost 1000 contexts were checked (600 for PDT part, 200 for PCEDT as well as for PDTSC part) and the annotated phenomena were adapted to the new structure of the data in the given places.

6 List of discourse-related annotation attributes in PDiT 4.0

Discourse-related annotation is captured mostly in a structured attribute discourse at the start node of the relation, additional annotation is captured in attributes discourse_groups and discourse_feature.

  • discourse/target_node.rf – id of the target node, or undefined if there is no target node (e.g. no hypertheme in a list structure)

  • discourse/type – the type of an arrow, two possible values: discourse (discourse relation), list (list entry)

  • discourse/start_range – start range of a discourse arrow; possible values: n where n (non-negative integer) = number of trees to the right of the actual tree belonging to the argument in addition to the node and its subtree (0 means just the node and its subtree), group (an arbitrary set of nodes; see below attributes discourse/start_group_id and discourse_groups), forward (means the node with its subtree plus a non-specified number of the following trees), backward (means the node with its subtree plus a non-specified number of the preceeding trees)

  • discourse/target_range – target range of a discourse arrow; possible values above

  • discourse/start_group_id – identifier of a group of nodes (positive integer) where the start_range of the arrow is set to "group"; individual nodes belonging to the group keep the group identifier in the attribute discourse_groups

  • discourse/target_group_id – identifier of a group of nodes (positive integer) where the target_range of the arrow is set to "group"; individual nodes belonging to the group keep the group identifier in the attribute discourse_groups

  • discourse/discourse_type – type of discourse semantic relation, such as cond (textual condition)

  • discourse/is_secondary – set to 1 if the relation is expressed by a secondary connective

  • discourse/is_negated – set to 1 if the relation is expressed by a negated secondary connective

  • discourse/comment – further specifies the discourse type for some relations expressed by secondary connectives; three possible values: Regard, Conclusion, Entailment.

  • discourse/t-connectors.rf – list of ids of nodes from the tectogrammatical layer that represent the discourse connective (or the core of the secondary discourse connective)

  • discourse/a-connectors.rf – list of ids of nodes from the analytical layer that represent the discourse connective (or the core of the secondary discourse connective)

  • discourse/t-connectors_ext.rf – list of ids of nodes from the tectogrammatical layer that represent the whole ("extended") secondary discourse connective

  • discourse/a-connectors_ext.rf – list of ids of nodes from the analytical layer that represent the whole ("extended") secondary discourse connective

  • discourse_groups – list of identifiers of groups the given node belongs to

  • discourse_feature – (three possible values for three special roles of the phrase represented by the node and its subtree: heading (replaces attribute is_heading from PDiT 1.0), metatext and caption. Replaces discourse_special from previous releases.

  • sense_PDTB3 – a transformation of the discourse type to a Penn Discourse Treebank 3.0 sense.

  • sense_PDTB3_manual – a manually filled-in sense for cases when automatic transformation of the discourse type to a Penn Discourse Treebank 3.0 sense would fail; this value was then used in sense_PDTB3

References

Mírovský, J., Synková, P., Poláková, L., Paclíková, M.: Cost-Effective Discourse Annotation in the Prague Czech–English Dependency Treebank. In: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), ELRA, Torino, Italy, ISBN 978-2-493814-10-4, ISSN 2522-2686, pp. 4067-4077, 2024

Mírovský, J., Synková, P. , Poláková, L., Rysová, M.: Prague to Penn Discourse Transformation. The Prague Bulletin of Mathematical Linguistics, (120):5–30, 2023.

Synková, P., Mírovský, J., Paclíková, M., Poláková, L., Rysová, M., Scheller, V., Zdeňková, J., Zikánová, Š., Hajičová, E.: Prague Discourse Treebank 4.0. Data/software, ÚFAL MFF UK, Prague, Czech Republic, LINDAT/CLARIAH-CZ: http://hdl.handle.net/11234/1-5680, Dec 2024a

Synková, P., Mírovský, J., Poláková, L., Rysová, M.: Announcing the Prague Discourse Treebank 3.0. In: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), ELRA, Torino, Italy, ISBN 978-2-493814-10-4, ISSN 2522-2686, pp. 1270-1279, 2024b

Synková, P., Rysová, M., Mírovský, J., Poláková, L., Scheller, V., Zdeňková, J., Zikánová, Š., Hajičová, E.: Prague Discourse Treebank 3.0. Data/software, ÚFAL MFF UK, Prague, Czech Republic, LINDAT/CLARIAH-CZ: http://hdl.handle.net/11234/1-4875, Dec 2022