Uniform Meaning Representation for Czech

Tags:

Annotations, Coreference, Corpora, Data, Lexicons, Monolingual, Morphology, Semantics, Syntax, Valency

Uniform Meaning Representation (UMR) for Czech

For centuries, linguists have deliberated on how to represent meaning. In recent years, this inquiry has been viewed not only as an intriguing theoretical challenge but also due to its practical implications for various applications, since meaning representation can serve, in general, as a basis for any system requiring sound and reliable knowledge representation to enable logical inference.

While numerous formalisms for meaning representation have been proposed in recent decades, this project focuses on specific approaches: the meaning representation used in the Prague Dependency Treebank family (PDT) and the Uniform Meaning Representation. The choice of the first formalism is motivated by the availability of data for Czech, particularly the PDT-C treebank. This treebank provides the most comprehensive Czech data (almost 175.5 thousand sentences across different genres) with fine-grained annotation at the tectogrammatical level, capturing linguistically structured meaning. The second approach, Uniform Meaning Representation (UMR), offers significant potential to enhance the PDT-C representation in several key ways:

UMR provides a more abstract representation, which is less dependent on a specific language and its structure.
UMR anchors concepts within a knowledge base, utilizing resources like English Wikipedia or WikiData.
UMR aims to support logical inference, an aspect that lies beyond the scope of the PDT.
Furthermore, UMR is being used for a variety of typologically diverse languages, including Chinese, Arapaho, Navajo, Kukama, and Sanapaná. This approach and its rich data may facilitate understanding some features of the Czech language from the typological point of view.

Project objective

The primary objective of the project is to explore the feasibility of a (semi-)automatic conversion of the PDT-C data into a format that adheres to the UMR specification. In particular, the project aims to identify:

Language phenomena that can be transferred relatively easily and reliably from the available Czech annotation to the UMR structures (as, e.g., sentence syntactic structure or coreference relations );
Phenomena that require specific treatment and detailed analysis but still can be transferred (as., e.g., modality or negation);
Phenomena that are unavailable in PDT-C and thus necessitate new annotations, either through automatic methods (utilizing advanced machine learning techniques) or even manual annotation (as, e.g., concept anchoring).

Data releases and release notes

UMR 1.0

This data release was without Czech data yet, but we put it here for completeness. It is available from the LINDAT/CLARIAH-CZ repository at http://hdl.handle.net/11234/1-5198. It contains all the data annotated by the U.S. team.

UMR 2.0

This data release contains the first version of the Czech conversion and manually prepared Latin data, also by the ÚFAL MFF UK team. It is available also at LINDAT/CLARIAH-CZ repositroy at TBA.

Publications and presentations

Lopatková, M., Fučíková, E., Gamba, F., Štěpánek, J., Zeman, D., Zikánová, Š.: Towards a Conversion of the Prague Dependency Treebank Data to the Uniform Meaning Representation. In Proceedings of the 24th Conference Information Technologies – Applications and Theory (ITAT 2024), CEUR-WS.org, Košice, Slovakia, p. 62-76, 2024.
Hajič, J., Fučíková, E., Lopatková, M., Urešová, Z.: Mapping Czech Verbal Valency to PropBank Argument Labels. In Proceedings of the Fifth International Workshop on Designing Meaning Representations (DMR 2024), LREC-COLING 2024, ELRA Language Resource Association, p. 88-100, 2024.
Hajič, J.: Towards a Conversion of the Prague Dependency Treebank Data to the Uniform Meaning Representation. Presentation at the UMR meeting 2024, slides

Related projects

The development of Czech UMR has been supported by the following projects:

Project UMR – Uniform Meaning Representation, No. LUAUS23283, in the Inter-Excellence II program (Inter-Action subprogram), 2023-2027
The project supports primarily cooperation with the U.S. partner, preparation for release, manual checks, and the work on the SynSemClass event-type ontolog for application on UMR.

Project LUSyD: Language Understanding: from Syntax to Discourse, GAČR EXPRO program, Project No. GX20-16819X
This project serves as the fundamental research on meaning representations in general, testing various Natural Language Understanding tools, work on discourse etc., and the foundations of the SynSemClass event-type ontology. From the UMR perspective, in serves for support of the basic understanding of the UMR principles in the broader approach to meaning representations.

Project of the large research infrastructure LINDAT/CLARIAH-CZ, project No. LM2023062, MŠMT LRI program
This project gives the infrastructural support for hosting the necessary data, tools and services developed in the UMR project and related resources. It also serves as the primary distribution repository for the U.S. partner-developed data.

The UMR for Czech is also related to the following project:

Adapting Uniform Meaning Representation (UMR) for the Italic/Romance languages, project No. 104924, GAUK

Search form