Uniform Meaning Representation (UMR) for Czech

For centuries, linguists have deliberated on how to represent meaning. In recent years, this inquiry has been viewed not only as an intriguing theoretical challenge but also due to its practical implications for various applications, since meaning representation can serve, in general, as a basis for any system requiring sound and reliable knowledge representation to enable logical inference.

While numerous formalisms for meaning representation have been proposed in recent decades, this project focuses on specific approaches: the meaning representation used in the Prague Dependency Treebank family (PDT) and the Uniform Meaning Representation. The choice of the first formalism is motivated by the availability of data for Czech, particularly the PDT-C treebank. This treebank provides the most comprehensive Czech data (almost 175.5 thousand sentences across different genres) with fine-grained annotation at the tectogrammatical level, capturing linguistically structured meaning. The second approach, Uniform Meaning Representation (UMR), offers significant potential to enhance the PDT-C representation in several key ways:

  • UMR provides a more abstract representation, which is less dependent on a specific language and its structure.
  • UMR anchors concepts within a knowledge base, utilizing resources like English Wikipedia or WikiData.
  • UMR aims to support logical inference, an aspect that lies beyond the scope of the PDT.
  • Furthermore, UMR is being used for a variety of typologically diverse languages, including Chinese, Arapaho, Navajo, Kukama, and Sanapaná. This approach and its rich data may facilitate understanding some features of the Czech language from the typological point of view.

Project objective

The primary objective of the project is to explore the feasibility of a (semi-)automatic conversion of the PDT-C data into a format that adheres to the UMR specification. In particular, the project aims to identify:

  • Language phenomena that can be transferred relatively easily and reliably from the available Czech annotation to the UMR structures (as, e.g., sentence syntactic structure or coreference relations );
  • Phenomena that require specific treatment and detailed analysis but still can be transferred (as., e.g., modality or negation);
  • Phenomena that are unavailable in PDT-C and thus necessitate new annotations, either through automatic methods (utilizing advanced machine learning techniques) or even manual annotation (as, e.g., concept anchoring).

Data releases and release notes

UMR 1.0

This data release was without Czech data yet, but we put it here for completeness. It is available from the LINDAT/CLARIAH-CZ repository at http://hdl.handle.net/11234/1-5198. It contains all the data annotated by the U.S. team.

UMR 2.0

This data release contains the first version of the Czech conversion and manually prepared Latin data, also by the ÚFAL MFF UK team. It is available also at LINDAT/CLARIAH-CZ repositroy at TBA.

Publications and presentations

Related projects

The development of Czech UMR has been supported by the following projects:

  • Project LUSyD: Language Understanding: from Syntax to Discourse, GAČR EXPRO program, Project No. GX20-16819X
    This project serves as the fundamental research on meaning representations in general, testing various Natural Language Understanding tools, work on discourse etc., and the foundations of the SynSemClass event-type ontology. From the UMR perspective, in serves for support of the basic understanding of the UMR principles in the broader approach to meaning representations.

  • Project of the large research infrastructure LINDAT/CLARIAH-CZ, project No. LM2023062, MŠMT LRI program
    This project gives the infrastructural support for hosting the necessary data, tools and services developed in the UMR project and related resources. It also serves as the primary distribution repository for the U.S. partner-developed data.

The UMR for Czech is also related to the following project: