The project aims at theoretical and corpus-based representation of global coherence in Czech written texts. Global coherence assumes a hierarchical representation of smaller (clauses, sentences) and larger text units (e.g. paragraphs) and existence of coherence relations between these units on all levels of the hierarchy. A single interconnected representation for the entire document is postulated, too. As a first step, up-to-date linguistic frameworks for global coherence analysis are critically evaluated. We benefit from our own long-term experience with describing various linguistic aspects of local coherence. Next, we will design a suitable scenario for representing global coherence with corpus methods and conduct a pilot annotation. The project combines and expands both the line of development of research on discourse and coherence at ÚFAL and recent advances in international discourse-oriented community.
In June 2023, the Czech Rhetorical Structure Theory Treebank 1.0 was released in the Lindat/Clarin repository. The open-access treebank represents the first application of a global coherence model (an adjusted Rhetorical Structure Theory) to Czech linguistic data. The development and adaptation of the framework for the purposes of the project and for the Czech language brought new insights into the higher structure of texts, these findings were documented in the project results. Our work also enables further exploration of the description method and the use of the analyzed data by a wide community, including comparative studies with other languages analyzed for global coherence. A summarizing publication can be found here (and also below in the Publications section).
The main focus in of the project in its third year was on determining a suitable annotation scheme for annotating higher text structure in Czech and on the annotation itself. We have critically evaluated the Rhetorical Structure Theory framework, assessed its usability for the intended coherence annotation task and finally adjusted it for our annotation purposes. This includes technical adjustments: dealing with interrupted clauses and the embedded contents, types of unit attachment points, solution of juxtaposed structures - and linguistic adjustments: introduction of new semantico-pragmatic labels, re-consideration of some original labels primarily used for English, analysis of attribution (the relation of the speaker/writer to the uttered content). The resulting annotation design proposal is described in a new annotation manual for annotating RST in Czech, currently in development.
On this basis, we have annotated 50 Czech written documents of different genres (news, comments, essays etc.) in one interconnected representation (projective tree structure) for each document. To annotate Czech texts in the RST system, we used a local installation of the rstWeb web tool, modified for the needs of the project by extending the list of tags offered by the tool to annotators. A certain portion of texts was annotated by two annotators, in order to address consistency issues and general issues of text interpretation and understanding. The outputs of rstWeb tool were used to measure inter-annotator agreement using the RST-Tace tool. The annotated data are in the process of checking, cleaning and will be subsequently published in the Lindat/Clarin repository as an open-access resource. Due to the pandemic situation, the project was extended by six months.
In the second year of the project, we have further focused on assessing and applying the early findings to the benefit of the intended global coherence analysis by corpus methods. A large summarizing study was published in the international journal Dialogue and Discourse (Poláková et al., 2021, see below). In this study, apart from deepening our early work on hierarchies in local annotation and extending and comparing it to English locally-annotated data, we offer a) a description of distribution differences in semantic types of relations in cross-paragraph vs. intra-paragraph settings in the Prague Dependency Treebank, b) a study of paragraph-initial discourse connectives with the identification of Czech connectives only typical for higher structures, c) the detection of prevalence of large left-sided arguments in locally annotated data, d) some new reflections on methodologies of the approaches under scrutiny.
Another line of research was dedicated to the role subjectivity and intentionality in discourse structure, more precisely the role of the so-called pragmatic (epistemic, speech-act) relations in discourse structuring. (Poláková and Synková, 2021). The starting points of the analysis were the extent and the way of author involvement in relation to the text content and text structuring, and an analysis of inferences. The detailed study of pragmatic relations (as opposed to semantic relations) with their widest contexts reveals a considerable diversity within this group and shows some space for improvement in local annotation schemes and also direct consequences for understanding (not only) the nature of rhetorical labels in the global RST framework. In RST, there is a similar division to semantic and presentational pragmatic relations (and a third category - textual), which needs to be reviewed for our purposes.
In preparation for the planned annotation of global coherence in the last year of the project, we have collected and prepared appropriate Czech texts from the PDiT-EDA treebank, a subset of the Prague Dependency Treebank annotated for local implicit relations. Further, we have selected and installed the annotation tool and tested its functionality.
In the first stage of the project, we have concentrated on the research of mutual configurations and hierarchical structures in local discourse relations (in local analytic approach), as compared to the principles of a global approach like the Rhetorical Structure Theory (RST). With qualitative and quantitative corpus methods and advanced querying system, we have described the ways and the extent, in which Czech data annotated for local coherence display features of higher text structure/global coherence. A first step of this research was published this year (Poláková and Mírovský, TSD 2020, see below).
On the basis of these findings, in terms of underlying theories and analytical methods in coherence processing, we have addressed the adequacy of some of the principles of local and global approaches to the description of discourse coherence on real texts, like the tree-like representation of documents (RST) or the minimality principle (Penn Discourse Treebank, PDTB). The findings for Czech data are quite similar to those for English data published earlier: that very few configurations of pairs of local discourse relation in fact break the tree-ness constraint applied in the RST. The most decisive factor here is the definition of a discourse unit (argument) in each theoretical frame, together with the annotators' biases in the local, incrementally proceeding analysis vs. the global perspective. A specific role is also played by the way of treatment of cues/signals of these relations, in our case specifically the treatment of secondary connectives (connective phrases).
We have further explored the role of long-distance (mostly anaphoric) relations and connectives, which, in different (global) analytic perspective, can be regarded as relations between large discourse units, relations of higher structure (Poláková et al., LREC 2020) and we have also studied specific connective roles of most common focalizers, which play a role in thematic progressions of a text and also function as operators in discourse relations (Hajičová, Mírovský, Štěpánková, PBML 2020).
Poláková Lucie, Mírovský Jiří, Zikánová Šárka, Hajičová Eva: Developing a Rhetorical Structure Theory Treebank for Czech. In: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Copyright © European Language Resources Association, Torino, Italy, ISBN 978-2-493814-10-4, ISSN 2522-2686, pp. 4802-4810, 2024
Poláková Lucie, Zikánová Šárka, Mírovský Jiří, Hajičová Eva: Czech RST Discourse Treebank 1.0. Data/software, LINDAT, Prague, Czech Republic, Jun 2023
Poláková Lucie, Mírovský Jiří: Connectives with both Arguments External: A Survey on Czech. In: Lecture Notes in Computer Science, Vol. 13451, 20th International Conference on Intelligent Text Processing and Computational Linguistics, Copyright © Springer Verlag, Berlin / Heidelberg, ISBN 978-3-031-24336-3, ISSN 0302-9743, pp. 61-72, 2023
Eva Hajičová, Jan Hajič, Barbora Hladká, Jiří Mírovský, Lucie Poláková, Kateřina Rysová, Magdaléna Rysová, Pavel Straňák, Barbora Štěpánková, Šárka Zikánová (2022): Corpus Annotation as a Feasible and Scientifically Beneficial Task. In: CLARIN: The Infrastructure for Language Resources, Copyright © Walter de Gruyter GmbH, Berlin/Boston, Mannheim, Germany, ISBN 978-3-11-076734-6, pp. 613-646.
Lucie Poláková (2022): Globální koherence českých textů a možnosti jejího korpusového zpracování. Zpráva o aktuálním projektu Ústavu formální a aplikované lingvistiky MFF UK. Jazykovědné aktuality, Vol. LIX, No. 1-2, Copyright © Ústav pro jazyk český AV ČR , Praha, Česká republika, ISSN 1212-5326, pp. 45-50.
Šárka Zikánová, Jiří Mírovský, Lucie Poláková (2022): Structuration globale du texte: une étude de corpus In: Écho des études romanes, Vol. 18, Copyright © Université de Bohême du Sud, České Budějovice, ISSN 1801-0865, pp. 99-115. Pdf on request.
Discourse Relations and Connectives in Higher Text Structure. In: Dialogue and Discourse, ISSN 2152-9620, vol. 12, no. 2, pp. 1-37,
Pragmatické aspekty v popisu textové koherence. In: Naše řeč, ISSN 0027-8203, vol. 104, no. 4, pp. 225-242,
Mining Local Discourse Annotation for Features of Global Discourse Structure. In: 23rd International Conference on Text, Speech and Dialogue, pp. 50-60, Springer, Cham, Switzerland, ISBN 978-3-030-58322-4,
Focalizers and Discourse Relations. In: The Prague Bulletin of Mathematical Linguistics, ISSN 0032-6585, 115, pp. 187-197,
GeCzLex: Lexicon of Czech and German Anaphoric Connectives. In: Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC 2020), pp. 1082-1089, European Language Resources Association, Marseille, France, ISBN 979-10-95546-34-4,