Principal investigator (ÚFAL): 
Project Manager (ÚFAL): 
Provider: 
Grant id: 
PRIMUS/23/SCI/023
ÚFAL budget: 
9158000
Duration: 
2023-2026

Language Neutral and Culturally Aware Multilingual Neural Sentence Representations

Recently, multilingual sentence representations allowed representing many languages in a single model and thus the zero-shot transfer of task-specific models between languages. These methods can potentially revolutionize computational linguistics and natural language processing (NLP) by unifying the processing of all languages into a single framework. Yet, the level cross-lingual alignment of current models is not sufficient for that.

We believe two points were neglected in previous work. Theoretical work suggests that physical perception might help to ground meaning – and eventually push the language neutrality of multilingual representation. Language meaning is socially constructed and inseparable from culture, which sets inherent limits for language neutrality. Multilingual representations must be aware of the cultural dimension of meaning, which should be interpretable and controllable.

In this project, we tackle these two issues of multilingual respresentation. As a results we want to make NLP models available in many languages without the need for explicit translation or task-specific data in multiple languages.

Publications

  1. Adnan Al Ali, Jindřich Libovický (2024): How Gender Interacts with Political Values: A Case Study on Czech BERT Models. In: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pp. 3038-3045, European Language Resources Association, Torino, Italy, ISBN 978-2-493814-10-4 (local PDF, bibtex)
  2. Katharina Hämmerl, Jindřich Libovický, Alexander Fraser (2024): Understanding Cross-Lingual Alignment—A Survey. In: Findings of the Association for Computational Linguistics: ACL 2024, pp. 10922-10943, Association for Computational Linguistics, Kerrville, TX, USA, ISBN 979-8-89176-099-8 (url, local PDF, local PDF, bibtex)
  3. Katharina Hämmerl, Andrei Alexandru Manea, Gianluca Vico, Jindřich Helcl, Jindřich Libovický (2024): CUNI and LMU Submission to the MRL 2024 Shared Task on Multi-lingual Multi-task Information Retrieval. In: Proceedings of the Fourth Workshop on Multilingual Representation Learning (MRL 2024), pp. 357-364, Association for Computational Linguistics, Kerrville, TX, USA, ISBN 979-8-89176-184-1 (url, bibtex)
  4. Jindřich Helcl, Zdeněk Kasner, Ondřej Dušek, Tomasz Limisiewicz, Dominik Macháček, Tomáš Musil, Jindřich Libovický (2024): Teaching LLMs at Charles University: Assignments and Activities. In: The Sixth Workshop on Teaching NLP: Proceedings of the Workshop, pp. 69-72, Association for Computational Linguistics, Kerrville, TX, USA, ISBN 979-8-89176-134-6 (url, local PDF, local PDF, bibtex)
  5. Jindřich Libovický, Jindřich Helcl (2024): Lexically Grounded Subword Segmentation. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 7403-7420, Association for Computational Linguistics, Kerrville, TX, USA, ISBN 979-8-89176-164-3 (url, bibtex)
  6. Philipp Rösch, Norbert Oswald, Michaela Geierhos, Jindřich Libovický (2024): Enhancing Conceptual Understanding in Multimodal Contrastive Learning through Hard Negative Samples. In: The 3rd Workshop on Advances in Language and Vision Research: Proceedings of the Workshop, pp. 102-115, Association for Computational Linguistics (ACL), Kerrville, TX, USA , ISBN 979-8-89176-153-7 (pdf, local PDF, local PDF, bibtex)
  7. Katharina Hämmerl, Björn Dieseroth, Patrick Schramowski, Jindřich Libovický, Constantin A. Rothkopf, Alexander Fraser, Kristian Kersting (2023): Speaking Multiple Languages Affects the Moral Bias of Language Models. In: Findings of the Association for Computational Linguistics: ACL 2023, pp. 2137-2156, Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-959429-62-3 (url, bibtex)
  8. Katharina Hämmerl, Alina Fastowski, Jindřich Libovický, Alexander Fraser (2023): Exploring Anisotropy and Outliers in Multilingual Language Models for Cross-Lingual Semantic Sentence Similarity. In: Findings of the Association for Computational Linguistics: ACL 2023, pp. 7023-7037, Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-959429-62-3 (url, bibtex)
  9. Jindřich Helcl, Jindřich Libovický (2023): CUNI Submission to MRL 2023 Shared Task on Multi-lingual Multi-task Information Retrieval. In: Proceedings of the The 2nd Workshop on Multi-lingual Representation Learning (MRL), pp. 302-309, Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 979-8-89176-056-1 (pdf, local PDF, local PDF, bibtex)
  10. Hynek Kydlíček, Jindřich Libovický (2023): A Dataset and Strong Baselines for Classification of Czech News Texts. In: 26th International Conference, TSD 2023, pp. 33-44, Springer, Cham, Switzerland, ISBN 978-3-031-40497-9 (url, bibtex)
  11. Jindřich Libovický (2023): Is a Prestigious Job the same as a Prestigious Country? A Case Study on Multilingual Sentence Embeddings and European Countries. In: Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 1000-1010, Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-955917-71-1 (pdf, local PDF, bibtex)