Prague Discourse Treebank 4.0 (PDiT 4.0)

The Prague Discourse Treebank 4.0 (PDiT 4.0; Synková et al. 2024) is an annotation of discourse relations marked by primary and secondary discourse connectives in the whole data of the Prague Dependency Treebank - Consolidated 1.0 (PDT-C 1.0). With respect to the previous versions (see a complete list below), annotating discourse relations in the whole PDT-C 1.0 means a significant increase in the size of the annotated data.

PDiT 4.0 annotates the whole PDT-C 1.0, i.e. all of its four subcorpora:

  • Prague Dependency Treebank (PDT); in the previous versions, PDiT only covered these data
  • Prague Czech-English Dependency Treebank (PCEDT); newly annotated in PDiT 4.0
  • Prague Discourse Treebank of Spoken Czech (PDTSC); newly annotated in PDiT 4.0
  • Faust; newly annotated in PDiT 4.0

Already since version 3.0, PDiT uses two taxonomies of types of discourse relations:

  • the original Prague taxonomy of discourse types, and
  • the Penn Discourse Treebank 3.0 (PDTB 3.0; Prasad et al. 2019) taxonomy of discourse senses.

Also since version 3.0, the PDiT data are offered in two formats:

  • the native format of the PDT-C 1.0, i.e. the Prague Markup Language, where the discourse relations are annotated on top of deep-syntax dependency trees (tectogrammatics), and
  • the Penn Discourse Treebank 3.0 (PDTB 3.0; Prasad et al. 2019) format of stand-off discourse annotation on plain texts.

Previous versions of discourse annotation of the PDT data were published either separately as versions of PDiT, or as a part of versions of the PDT:

  • the Prague Discourse Treebank 3.0 (PDiT 3.0; Synková et al. 2022),
    • largely revised annotation of discourse relations, newly also in the PDTB 3.0 format and sense taxonomy
  • the PDT part of the Prague Dependency Treebank - Consolidated 1.0 (PDT-C 1.0; Hajič et al. 2020),
    • fixes of individual errors in the discourse annotation
  • the Prague Dependency Treebank 3.5 (PDT 3.5; Hajič et al. 2018),
    • fixes of individual errors in the discourse annotation
  • the Prague Discourse Treebank 2.0 (PDiT 2.0; Rysová et al. 2016),
    • discourse relations marked by selected secondary connectives, updated discourse relations marked by primary connectives
  • the Prague Dependency Treebank 3.0 (PDT 3.0; Bejček et al. 2013), and
    • genres of documents, updated discourse relations marked by explicit (mostly primary) connectives
    • also, as a part of the underlying data, pronominal textual coreference of 1st and 2nd person
  • the Prague Discourse Treebank 1.0 (PDiT 1.0; Poláková et al. 2012)
    • discourse relations marked by explicit (mostly primary) connectives
    • also, as a part of the underlying data, extended textual coreference and bridging anaphora

The Prague Discourse Treebank 4.0 can be downloaded from the LINDAT-Clarin repository (see the Licence).

Upcoming in December 2024.

References

Bejček, E., Hajičová, E., Hajič, J., Jínová, P., Kettnerová, V., Kolářová, V., Mikulová, M., Mírovský, J., Nedoluzhko, A., Panevová, J., Poláková, L., Ševčíková, M., Štěpánek, J., Zikánová, Š.: Prague Dependency Treebank 3.0. Data/software, Univerzita Karlova v Praze, MFF, ÚFAL, Prague, 2013. (http://ufal.mff.cuni.cz/pdt3.0/)

Hajič, J., Bejček, E. Bémová, A., Buráňová, E., Fučíková, E., Hajičová, E., Havelka, J., Hlaváčová, J., Homola, P., Ircing, P., Kárník, J.,  Kettnerová, V., Klyueva, N., Kolářová, V., Kučová, L., Lopatková, M., Mareček, D., Mikulová, M., Mírovský, J., Nedoluzhko, A., Novák, M., Pajas, P., Panevová, J., Peterek, N., Poláková, L., Popel, M., Popelka, J., Romportl, J., Rysová, M., Semecký, J., Sgall, P., Spoustová, J., Straka, M., Straňák, P., Synková, P., Ševčíková, M., Šindlerová, J., Štěpánek, J., Štěpánková, B., Toman, J., Urešová, Z., Vidová Hladká, B., Zeman, D., Zikánová, Š., Žabokrtský, Z.: Prague Dependency Treebank - Consolidated 1.0 (PDT-C 1.0). Data/software, LINDAT-CLARIAH, URL: http://hdl.handle.net/11234/1-3185, 2020.

Hajič, J., Bejček, E., Bémová, A., Buráňová, E., Hajičová, E., Havelka, J.,  Homola, P., Kárník, J., Kettnerová, V., Klyueva, N., Kolářová, V., Kučová, L., Lopatková, M., Mikulová, M., Mírovský, J., Nedoluzhko, A., Pajas, P.,  Panevová, J., Poláková, L., Rysová, M., Sgall, P., Spoustová, J.,  Straňák, P., Synková, P., Ševčíková, M., Štěpánek, J., Urešová, Z.,  Vidová Hladká, B., Zeman, D., Zikánová, Š. and Žabokrtský, Z.: Prague Dependency Treebank 3.5. Institute of Formal and Applied Linguistics, LINDAT/CLARIN, Charles University, 2018. (http://hdl.handle.net/11234/1-2621).

Hajič et al.: Prague Dependency Treebank 2.0. Data/software, Linguistic Data Consortium, Philadelphia, PA, USA, 2006. ISBN 1-58563-370-4 (http://www.ldc.upenn.edu)

Poláková, L., Jínová, P., Zikánová, Š., Hajičová, E., Mírovský, J., Nedoluzhko, A., Rysová, M., Pavlíková, V., Zdeňková, J., Pergler, J., Ocelák, R.: Prague Discourse Treebank 1.0. Data/software, Univerzita Karlova v Praze, MFF, ÚFAL, Prague, 2012. (http://ufal.mff.cuni.cz/pdit/)

Prasad, R., Webber, B., Lee, A.  and Joshi, A.: Penn Discourse Treebank Version 3.0. Data/Software, Linguistic Data Consortium, University of Pennsylvania, Philadelphia, LDC2019T05, 2019

Rysová, M., Synková, P., Mírovský, J., Hajičová, E., Nedoluzhko, A., Ocelák, R., Pergler, J., Poláková, L., Scheller, V., Zdeňková, J., Zikánová, Š.: Prague Discourse Treebank 2.0. Data/software, ÚFAL MFF UK, Prague, Czech Republic, Lindat/Clarin: http://hdl.handle.net/11234/1-1905, Dec 2016

Synková, P., Mírovský, J., Paclíková, M., Poláková, L., Rysová, M., Scheller, V., Zdeňková, J., Zikánová, Š., Hajičová, E.: Prague Discourse Treebank 4.0. Data/software, ÚFAL MFF UK, Prague, Czech Republic, LINDAT/CLARIAH-CZ: http://hdl.handle.net/11234/1-5680, Dec 2024

Synková, P., Rysová, M., Mírovský, J., Poláková, L., Scheller, V., Zdeňková, J., Zikánová, Š., Hajičová, E.: Prague Discourse Treebank 3.0. Data/software, ÚFAL MFF UK, Prague, Czech Republic, LINDAT/CLARIAH-CZ: http://hdl.handle.net/11234/1-4875, Dec 2022