In WP4, based on findings during manual corrections of the parser results on the Czech part of the Prague Czech-English Dependency Treebank (PCEDT-cz, WP7), the parser has been continuously improved during 2023, to ensure the best possible starting point for annotation of the Prague Discourse Treebank of Spoken Czech (PDTSC) in 2024. As improvements of the parser continued to the very end of 2023, the actual processing of the PDTSC data has been postponed to January 2024.
In WP5, approx. 1 thousand sentences of the PDTSC were annotated manually, to test quality of both the parser and the final partially manually corrected data.
In WP6, after consultations with the main authors of the PDT-C, Faust has been approved for annotation with discourse relations (intra-sentential ones, as Faust consists of sentences that do not form a continuous text). The main reason for the inclusion was a general principal of annotations included in the PDT-C, namely presence in the whole data.
WP7 represented the most work done on the project in 2023. The data of the PCEDT-cz were enriched with projected discourse annotation from the PDTB, as well as with automatically parsed discourse relations. Complex rules were devised to join these two inputs automatically, unclear cases were subsequently manually checked and corrected; the resulting data contains 28 thousand explicit discourse relations. Measured against 1 thousand manually annotated data, the quality of the resulting annotation compares favourably with inter-annotator agreement reported for PDiT 1.0.
Two conference papers were prepared and submitted to LREC-Coling 2024.