NPFL120 – Multilingual Natural Language Processing

The course focuses on multilingual aspects of natural language processing. It explains both the issues and the benefits of doing NLP in a multilingual setting, and shows possible approaches to use. We will target both dealing with multilingual variety in monolingual methods applied to multiple languages, as well as truly multilingual and crosslingual approaches which use resources in multiple languages at once. We will review and work with a range of freely available multilingual resources, both plaintext and annotated.

About

SIS code:	NPFL120
Semester:	summer
E-credits:	3
Examination:	1/1 KZ
Guarantors:	Daniel Zeman Rudolf Rosa Ondřej Bojar
Taught in:	English, unless all students present understand Czech.

Timespace Coordinates

in summer semester 2025, the course takes place every Monday 9:00-10:30 in SW1 lab at Malá Strana
videorecordings of classes from last years are also available

Informal prerequisities

We suggest students to first attend the NPFL100 course Variability of languages in time and space / Variabilita jazyků v čase a prostoru, which looks more theoretically and linguistically onto many phenomena that we will look at more practically and computationally.

Some basic programming skills are expected, e.g. from the NPFL125 course Introduction to Language Technologies.

The course complements nicely with the NPFL070 course Language Data Resources.

Organization of the course

The course has the form of a practical seminar in the computer lab. In each class we will try to combine a lecture with practical hands-on exercises (students are therefore required to have a unix lab account).

Lectures

1. Introduction; WALS Slides wals Lecture recording Practicals recording

2. Alphabets, encoding, language identification Slides Lecture recording Practicals recording

3. Tokenization and Word Segmentation Slides tokenization Online class recording

4. Interset, POS harmonization Slides pos_harmonization Online class recording

5. Machine Translation (Ondřej Bojar) Slides mt

6. Cross-lingual POS tagging Slides pos_tagging Online class recording

7. Delexicalized parsing Slides delex_parsing Online class recording (lecture 1:20 to 57:46)

8. Tree projection Tree projection tree_projection Online class recording (lecture 0:49 to 24:35)

9. Treebank translation Treebank translation tree_translation Online class recording (lecture 24:35 to 41:22)

10. Word Embeddings Slides embeddings Online class recording (lecture 0:48 to 49:20)

11. Contextual Word Embeddings Slides bert Online class recording

12. Syntax harmonization and Enhanced Universal Dependencies Slides enhancing_ud

13. Multilingual Machine Translation (Ondřej Bojar) Slides Online class recording The Reality of Multi-Lingual Machine Translation

Requirements

Homework tasks

There will be homework from most of the classes, typically based on finishing and/or extending the exercises from that class.

An important part of each homework assignment is submitting a brief report on what you did and what you found out. It can be e.g. just 5 sentences in plaintext, or 5 pages with tables and figures in PDF, whatever seems appropriate based on what you did.

To pass the course, you will be required to actively participate in the classes and to submit homework tasks. The quality of your homework solutions will determine your grade.

Grading rules

You get some points for each homework, typically between 1 and 4. A standard good solution gets 3 points – a weaker solution gets less, a stronger solution gets more. Then, if your final average of points per homework is at least 3.0, you get the grade 1; otherwise you get a lower grade: grade 2 for average above 2.5; grade 3 for average above 2.0.

No cheating

Cheating is strictly prohibited and any student found cheating will be punished. The punishment can involve failing the whole course, or, in grave cases, being expelled from the faculty.
Discussing homework assignments with your classmates is OK. Sharing code is not OK (unless explicitly allowed); by default, you must complete the assignments yourself.
All students involved in cheating will be punished. E.g. if you share your assignment with a friend, both you and your friend will be punished.

License

Unless otherwise stated, teaching materials for this course are available under CC BY-SA 4.0.

The plan of the classes is preliminary and may be updated without warning.

1. Introduction; WALS

Feb 17 Slides wals Lecture recording Practicals recording

2. Alphabets, encoding, language identification

Feb 24 Slides Lecture recording Practicals recording

langid library and the accompanying paper
CLD3, presumably state of the art language identification
cleaned UDHR dataset, originally from here
mix of languages
Unidecode for transliterating nearly any characters into ASCII
- A similar but different tool by Dan Zeman can be found on Github, or in /home/zeman/projekty/translit/translit.pl on the ÚFAL network, if you have access.
get iso codes of languages here
unicodedata for various unicode stuff in Python
If you are facing problems with Unicode in the command line on Windows, try creating a shortcut that invokes the command line like this: C:\Windows\System32\cmd.exe /k "chcp 65001"

License

Unless otherwise stated, teaching materials for this course are available under CC BY-SA 4.0.

The plan and deadlines of the assignments is preliminary and may be updated.

Rules

use any programming language you like (we suggest Python)
submit
- source codes
- short report
  - at least a few sentences, saying what you did, how you did it, how it worked, what you observed in the results, etc.
  - if it makes sense, please also include a sample of the results/outputs
  - it can be a long report if there is a lot to say about what you did (e.g. 5 pages with tables and figures), but otherwise a few sentences are sufficient (e.g. 5 sentences in a plaintext file)
  - the report is more important than the source codes: we may or may not check/run your code, but we will always read the report
  - use any reasonable format you like (TXT, PDF, MD, DOC...)
submit via a Git repository
- create a Git repository somewhere, probably faculty GitLab
- give read access to the repository to Rudolf and send him the address of the repository
- the deadline is in ~1.5 weeks by default (in 2025 this means Thursday 23:59)
you will get points
- 3 points is the base for an OK solution
- less points for a bad solution, 0 points for no solution
- more points for a great solution, doing something clever, doing more work, going deeper, finding something good...
- if your point average is at least 3 at the end of the semester, you get the grade 1
- otherwise you get a lower grade: grade 2 for average above 2.5; grade 3 for average above 2.0.
feel free to go deeper
- We are sometimes operating at the edge of current research frontier, so in any of the assignments there is a chance that you will discover something new (worth publishing at a scientific conference, or investigating more in a diploma thesis, etc.)
- So feel free to go as deep as you want in any of the assignments!
- You can even diverge from the task if you come up with something more interesting to do. Just follow your fantasy :-) Because this is how you research.
- You will get more points if you do anything beyond the base task (and if it is extra interesting, we can talk about publishing it in a scientific paper).
- But also feel free to simply do the assignment as it is set, this will still give you 3 points. You can do more, but you do not have to.

wals

Deadline: Feb 27 3 points

WALS online for clicking
language.tsv -- WALS dataset for computer processing (free to download in CSV, this file has been covnerted to TSV for convenience; but it was generated in 2018 and WALS was updated in the meantime, so you may want to download the new original WALS dataset instead)
greping and cuting in the WALS dataset
Homework: a script for measuring language similarity using the WALS dataset
- Idea: similarity of a pair of languages can be estimated by comparing their WALS features, e.g. by counting the number of WALS features in which they are similar (Agić, 2017). The simplest way is to iterate over the features, ignoring those that are undefined for one of the two languages, and adding 1 to the score if the values match or 0 if they do not match. If you then divide this by the number of features, you get the Hamming similarity.
- You can either do the tasks 1-3 (1 is really THE task, 2 and 3 are just simple extensions), or you can do the harder alternative task.
- Task 1: input = WALS code of one language, output = WALS code and similarity scores for most similar languages.
- Task 2: input = genus (e.g. "Slavic"), output = centroid language of that genus, i.e. a language most similar to other languages of the genus
- Task 3: find the weirdest language, i.e. most dissimilar to any other language (for whole WALS, or for a given language genus/family)
- Alternative task: automatically generate missing values in WALS (e.g. if all Slavic languages have the number of genders either 3 or unspecified, you can probably set the unspecified values to 3). This is a harder task, so if you do this one, you do not have to do the tasks 1-3.
- The definition of the task is somewhat vague, feel free to spend as much or as little time with it as you wish
Another existing approach: lang2vec

tokenization

Deadline: Mar 13 3 points

One tokenizer you may often encounter is the Moses tokenizer:

mkdir -p mosestok/tokenizer/; cd mosestok/tokenizer/
wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/tokenizer/tokenizer.perl
chmod u+x tokenizer.perl; cd ..; mkdir -p share/nonbreaking_prefixes/; cd share/nonbreaking_prefixes/
wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/share/nonbreaking_prefixes/nonbreaking_prefix.en
cd ../../..
mosestok/tokenizer/tokenizer.perl -h

Quite powerful tokenizer is part of UDPipe -- download UDPipe 1.3.1, download UD 2.5 models (or some newer models), see the UDPipe manual
When using LLMs and similar neural network approaches: typically subword/wordpiece tokenization.
- For GPT family of models: Open AI tokenizer. Try it out with English versus non-English texts.
- ```
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")
tokens = enc.encode("hello world")
for t in tokens:
    print(t, enc.decode([t]))
```
Try tokenizing the sentences from the slides with Moses tokenizer and with UDPipe tokenizer -- see Running UDPipe tokenizer
- hint: udpipe --tokenize path/to/model < input.txt
- Володимир Зеленський @ZelenskyyUa Президент України: Поговорив із Прем'єр-міністром 🇬🇧 @BorisJohnson і Президентом 🇵🇱 @AndrzejDuda про актуальну безпекову ситуацію. Погодили подальші спільні кроки щодо протидії агресору. Антивоєнна коаліція в дії! #StandwithUkraine 🇺🇦🇺🇦🇺🇦
- Playing with quotes: “English” — ‘English’ — „česky“ — ‚česky‘ — « français » — ‹ français › — „magyar” — »magyar« — ’magyar’ -- and ``tex quotes'' --- 'cause it's a mess, you know... But don’t don‘t don’t don’t don't talk 'bout that too much or students' heads'll explode!
- Varied Chinese punctuation: 「你看過《三國演義》嗎？」他問我。“你看過‘三國演義’嗎?”他問我.
- Vietnamese: Tất cả đường bêtông nội đồng thành quả
- Japanese: 経堂の美容室に行ってきました。
- Spanish: «¡María, te amo!», exclamó Juan. “María, I love you!” Juan exclaimed. ¿Vámonos al mar? Escríbeme a rur@nikde.eu. Soy de 'Kladno'... Tiene que bañarse.
Download the cleaned UDHR dataset and try tokenizing some of the texts with UDPipe
- cmn = Chinese (Mandarin), yue = Cantonese, jpn = Japanese, vie = Vietnamese...
Universal Dependencies -- download UD 2.5 to have the version on which the most recent UDPipe models were trained (so UDPipe should work best with these data); or download the newest version, at the time of writing this is UD 2.13
Some languages have more than one treebank. Does the tokenizer work similarly well on each of them? (I.e. are the treebanks tokenized similarly?) See Measuring Model Accuracy in UDPipe manual
- hint: add the --accuracy switch and use the treebank test file (xyz-ud-test.conllu) as input
Task 1: Some languages have small or no training data in UD and there is no trained UDPipe tokenizer for them yet. How would you tokenize e.g. Cantonese (yue), Buryat (bxr), or Upper Sorbian (hsb)?
- Try to find a reasonable UDPipe tokenization model for these three languages (e.g. for tokenizing Cantonese, maybe using the Chinese model makes sense?).
- You may try to reuse what you did in hw_wals ;-)
- Report which tokenizer you chose for each of the languages, how you did that, and what accuracy it achieves (again, evaluate the tokenizer on the test data).
Task 2: train a UDPipe tokenizer for one of the languages for which no trained model is available -- see Training UDPipe Tokenizer
- hint: udpipe --train --tagger=none --parser=none output_model.udpipe < xyz-ud-train.conllu
- hint: no model is avalable if the available data is low (typically missing training data), so you either have to do a different split of the data, or perform n-fold cross-validation (so that you can evaluate the tokenizer on something)
- report the accuracy you got, and compare it to using an existing tokenizer model trained on larger data for a different language
- sanity check: also run the tokenizer on some plaintext data for the language (probably from UDHR) and check that it actually does perform some reasonable-looking tokenization

pos_harmonization

Deadline: Mar 20 3 points

Tagset harmonization exercise: You get a syntactic parser trained on the UD tagset (UPOS and Universal Features), and data tagged with a different tagset. Try to convert the tagset into the UD tagset to get better results when applying the parser to the data.
- The data in the CoNLLU format and the trained UDPipe models can be found at http://ufallab.ms.mff.cuni.cz/~rosa/npfl120/pos_harm/.
- Running the parser
  - To run the parser and get results in the CoNLLU format, use e.g.: cat ta-ud-dev-orig.conllu | ./udpipe --parse ta.sup.parser.udpipe
  - To view the tree structures in the CoNLLU data, you can use e.g. view_conll or Udapi.
  - To evaluate the parsing accuracy, use e.g.: cat ta-ud-dev-orig.conllu | ./udpipe --parse --accuracy ta.sup.parser.udpipe
- This may be a bit confusing, so to clarify:
  - For POS harmonization, we only assume to have the first 6 fields of the CoNLLU available as input.
  - We do not do intrinsic evaluation here by evaluating directly the correctness of the harmonization.
  - Rather, we do extrinsic evaluation, evaluating the success of the harmonization indirectly by using the harmonized tags as input to a syntactic parser and observing the parsing accuracy.
  - To be able to that, the data also contains the other two fields (HEAD and DEPREL) which is the syntactic annotation, against which we evaluate the parser.
  - Do not use HEAD and DEPREL in your code for the harmonization, assume that these are not available! I.e. these two columns are sort of the test data for you.
- The tagset documentations (in practice it is often quite hard to get a proper documentation for the tagset, but we decided to be nice to you):
  - Universal POS (UPOS) (UD 2.0)
  - English (Penn Treebank Tagset)
  - Czech (Prague Dependency Treebank tagset)
  - Tamil (Tamil Dependency Treebank tagset – based on PDT)
  - German (Stuttgart-Tübingen tagset)
  - Latin(Perseus Treebank tagset)
- Try to achieve some reasonable parsing accuracy – I guess at least 50% should be achievable rather easily.
  - Note that 100% accuracy is not reachable; the UAS upper bounds (measured on UD test data) are: CS 90%, DE 85%, EN 88%, LA 68%, TA 78%
- Your task is to try to do the harmonization yourself, not using any pre-existing tools for that.
- Homework:
  - Harmonize the tagset for one of the languages.
  - You can use the template harmonize.py
  - Turn in the code that you used.
  - Report the parsing accuracy before and after your harmonization (both UAS and LAS); please measure the accuracy repeatedly during the development and report which changes to your solution brought which improvements of the parsing accuracy.
  - The minimum is to identify some of the main POS categories, such as verbs, nouns, adjectives, and adverbs, so that you get a reasonable parsing accuracy. For doing that, you can get 2 points for the homework. You can get more points if you further improve your solution; some suggestions are listed below.
  - You can try to identify more POS categories; ideally you should map all of the original POS tags to some UPOS tags.
  - (You can try to produce some of the Universal Features (documentation) – but this will most probably not work well, as UDPipe uses the features as one atomic string.)
  - You can try to cover all of the languages, at least in a basic way.
  - You can figure out how to use Interset (see the lecture), use it to harmonize the tagset, and compare the parsing accuracy achieved when using your solution and when using Interset (but you still need to create at least a simple solution of your own).

mt

3 points

lab instructions: http://ufal.mff.cuni.cz/~zeman/langtech/npfl120/multiling-04-lab.txt

Visually compare the left, right and intersection alignments ... check in how many sentences you see the 'garbage alignments' that all fall onto one word
Compare the intersection alignment for the baseline and improved alignments.
Write a small script that reads:
1. source tokens
2. target tokens
3. alignment
and emits all pairs of aligned words.

If run through sort | uniq -c | sort -n, this would be a translation dictionary.
Within the homework assignment, it is sufficient to get up to here, the further steps are optional
Continue the moses tutorial to train a phrase-based model (apply mert-moses.pl).
Apply the trained model.
Compare the translations from the default run and from the run with these model flags:
```
-dl=0 -max-phrase-length 1
```
Some hints
- Jacob: I found a solution for the compilation issue. In case any other students are having trouble with modern compilers, this is the solution that worked for me: set the standard to C++03. I did this via an environment variable so I wouldn't have to edit the build files.
```
export CFLAGS_GLOBAL="-std=c++03"
```

The homework assignment is voluntary, you can submit it for extra points but you do not have to.

pos_tagging

Deadline: Apr 3 3 points

devise a cross-lingual POS tagger for one under-resourced target language
- start here, finish as homework
- report what you did and your POS tagging accuracy on the UD test data
suggested target language: Kazakh (kk) / Telugu (te)
- there are some training data in UD, but let's pretend there are none and just use the test data
- there are some reasonable parallel data
- there is at least one reasonable high-resource source language for each of these to project the POS tags from -- choose the source language(s) yourself
POS projection over (multi)parallel data
1. take parallel data -- I suggest Watchtower and/or OpenSubtitles from OPUS:
  - Watchtower (do not share) by Agić+ (2016)
  - OPUS by Tiedemann and Nygaard (2004)
  - WTC data is in a multiparallel format
    - the same line in all the files corresponds to the same sentence in the various languages
    - but some lines may be empty, as not all sentences are present in all the files
  - some Opus data are multiparallel, but I don't know how to easily get the multiparallel sentence alignment
    - so if you use multiple sources at once, I suggest you use WTC
2. POS tag the source side of the parallel data
  - you can use the trained UDPipe models
  - tokenize and tag with UDPipe:
```
udpipe --tokenize --tag path/to/model < input.txt > output.conllu
```
  - or tag an already tokenized text:
```
udpipe --tag --input=horizontal path/to/model < input.txt > output.conllu
```
  - or to only convert tokenized text to CONLLU format:
```
udpipe --input=horizontal path/to/model < input.txt > output.conllu
```
3. word-align source and target
  - you can use Giza++ or efmaral or FastAlign (see below)
    - important: you need to have cmake for the installation of FastAlign, so if you don't have it, get it at https://cmake.org/download/ (or see FastAlign website for instructions)
  - FastAlign installation:
```
git clone https://github.com/clab/fast_align.git
cd fast_align
mkdir build
cd build
cmake ..
make
```
  - I suggest to use intersection alignment symmetrization, but you can play with this a bit
  - FastAlign usage (add -s to also output alignment scores):
```
paste cs sk | sed 's/\t/ ||| /' | grep '. \|\|\| .' > cs-sk
fast_align -d -o -v -i cs-sk > cs-sk.f
fast_align -d -o -v -r -i cs-sk > cs-sk.r
atools -i cs-sk.f -j cs-sk.r -c intersect > cs-sk.i
```
4. project POS tags through the alignment from the tagged source to the non-tagged target
  - you can use the template pos_project.py (but it was created for a slightly different purpose so you may need to change it a bit or a lot)
  - take inspiration from the lecture to do the projection
    - simply copying the POS tag from source to target with no other tricks is sufficient to get 2 points for the assignment
      - you still need to do something with unaligned words or multiply aligned words (e.g. voting or weighted voting, or simply use the knowledge that NOUN is usually the most frequent POS...)
    - doing something more clever carries more points
    - ideally start with the simple solution, measure the base accuracy, then implement some improvements, and repeatedly measure the increase in accuracy (if any)
5. train tagger on the target data
```
udpipe --train --tokenizer=none --parser=none --tagger='use_xpos=0;use_features=0' output.model < input.conllu
```
6. evaluate the tagger on target test data
```
udpipe --tag --accuracy path/to/model < test.conllu
```
other notes (not important for this HW)
- you can use HunAlign sentence aligner if you use parallel data that are not sentence-aligned: install_hunalign.sh_, hun_align.sh_
- some data in Opus are weird; OpenSubtitles and Tanzil are nice
- once you have word-aligned data, you can also extract a simple word-to-word translation dictionary (this single-best translation is weaker than e.g. Moses as it does not take the context into account)

delex_parsing

Deadline: Apr 10 3 points

applying lexicalized versus delexicalized parsers in a monolingual and cross-lingual setting
- trained lexicalized ("sup") and delexicalized ("delex") UDPipe 1.2 models trained on UD 2.1 treebanks
- language groups for experimenting:
  - Norwegian (no), Danish (da), Swedish (sv)
  - Czech (cs), Slovak (sk)
  - Spanish (es), Portuguese (pt)
- UD treebanks
- evaluating a trained UDPipe parser on test treebank data (only parsing, no tagging!):
```
udpipe --parse --accuracy path/to/model < test.conllu
```
- training a delexicalized UDPipe parser (without morpho features); the last parameter (cs.delex.parser.udpipe) is on output parameter, i.e. udpipe will create this file and store the model into it:
```
cat cs-ud-train.conllu | ./udpipe --train --parser='embedding_form=0;embedding_feats=0;' --tokenizer=none --tagger=none cs.delex.parser.udpipe
```
Homework:
- Extended your cross-lingual POS tagging homework to cross-lingual parsing
- Train a delexicalized parser on a source language treebank, and apply it to your cross-lingually-POS-tagged target-language data
- Report the parsing accuracies you obtain (LAS and UAS)
- You may also try the source-lexicalized parsing:
  - Train a standard lexicalized parser (but still without morpho features) on the source language
  - Apply it to the target language (without any translation)
  - This will only work well if there is a substantial amount of shared vocabulary between the source and the target language, i.e. they are lexically very close
Other notes -- combining multiple parsers via the MST algorithm (you do not have to do this in this HW):
- parse a sentence with mutiple parsers -- you get multiple parse trees, i.e. 3 sets of dependency edges if you used 3 parsers
- assign weights to the edges (e.g. 1 if the edge appeared in one parser output, 2 if in 2, etc.; or incorporate language similarity into the weights as well, i.e. edges from less similar languages get a lower weight)
- give the list of edges and their weights to a MST algorithm, which outputs the best tree that can be constructed from the edges
- you can use my Perl wrapper of the Perl Graph::ChuLiuEdmonds library (or look at my code and use the library directly from your Perl code; unfortunately I am unaware of any good implementation of a directed MST algorithm in Python)
  - my wrapper takes standard input (one sentence per line) and writes to standard output (one sentence per line
  - the input format is
```
number_of_nodes parent child weight parent child weight...
```
    where parent and child are 1-based integer IDs of the parent and child nodes of the edge and the weight is a weight you assign to the edge, so e.g.:
```
3 0 2 1.5 2 1 0.5 2 3 0.5 3 1 1.2
```

tree_projection

Deadline: Apr 17 3 points

Projecting trees over parallel data:
- all data is here: PUD = parallel treebanks, align = alignments by FastAlign
- Beware: CONLL-U token IDs are 1-based, FastAlign token IDs are 0-based
- Beware: tokens with non-integer ID (like 5-6 or 8.1) are not part of the tree nor of the alignment (so maybe you can just grep them away)
- Beware: forms and lemmas can contain spaces in CONLL-U
- You can use the template project.py which I prepared (it does the reading in and writing out)
- because this is parallel treebank, you have gold standard annotation for both the source tree and the target tree, so you can measure the accuracy of your projection (in real life, you have parallel data which do not have any annotation, so you need to parse the source data with a parser, and then train a parser on the target data)
  - you can use e.g. my evaluator.py for that
  - use it e.g. as python3 evaluator.py -j -m head gold.conllu pred.conllu
  - run it as python3 evaluator.py -h for more info; most importantly, you can also use -m deprel or -m las
- Homework:
  - implement the projections somehow
  - try to ensure that what you produce is a rooted tree (only one root, all nodes have a head assigned, no cycles); report how you did this and if you succeeded
  - also predict deprels somehow (projecting is good, looking at POS can also be good)
  - evaluate your solution automatically for several language pairs and report the scores
  - ideally, also compare the accuracies to the delex parsing approach
  - report what you found out

tree_translation

Deadline: Apr 24 3 points

Lab: cross-lingual parsing lexicalized by translation of the training treebank using machine translation
- we get back to the VarDial 2017 cross-lingual parsing shared task setup: 3 language pairs (one is actually a triplet), using supervised POS tags:
  - Czech (cs) source, Slovak (sk) target
  - Slovene (sl) source, Croatian (hr) target
  - Danish (da) and/or Swedish (sv) source, Norwegian (no) target
- choose any language pair you want to, or use other languages if you want to
- for the language pairs above, some datasets are prepared for the lab (but that's only a minor convenience, you can simply use UD treebanks and e.g. OpenSubtitles or WatchTower parallel data for any languages)
  - "treebanks" are the training treebanks for the source languages and evaluation treebanks for the target languages
  - "smaller_delex_models" are the baselines, i.e. delexicalized UDPipe parsers trained on the first 4096 sentences from the training treebanks; apply them to the target evaluation treebanks to measure the baseline accuracy (around 55 LAS I think)
  - "our_vardial_models" are lexicalized parsing models which we submitted into the competition, about +5 LAS above the baselines (can you beat us?! :-))
  - "para" are parallel data, obtained from OpenSubtitles2016 aligned by MonolingualGreedyAligner with intersection symmetrization (the format of the data is "sourceword[tab]targetword" on each line); there are also "tag" variants where POS tag and morphological features are annotated for the source word
  - "translate_treebank.py" is a simple implementation of treebank translation which you can use for your inspiration
- the baseline approach is to translate each word form in the source treebank (second column) by its most frequent target counterpart from the parallel data (as done by the sample "translate_treebank.py" script), and then train a standard UDPipe parser on that:
  udpipe --train --tokenizer=none --tagger=none out.model < train.conllu
  and evaluating the parser on the target evaluation treebank:
  udpipe --parse --accuracy out.model < dev.conllu
- there are many possible improvements to the approach:
  - use better word alignment (e.g. FastAlign intersection alignment)
  - use the source POS tags and/or morphological features for source-side disambiguation -- e.g. the word "stát" in Czech should be translated differently as a noun ("state") and as a verb ("stand"); you already have this annotation in the source treebank, and you can get it in the parallel data using a UDPipe tagger trained on the source treebank (which is how we produced the "tag" variants of the para data, which you can use)
  - use multiple source languages -- either combine the parsers using the MST algorithm, or simply concatenate the source treebanks into one (that's what we did in VarDial for Danish and Swedish -- if you see the "ds" language code, this means just that)
  - use a proper MT system (word-based Moses probably?)
  - use your knowledge of the target language for some additional processing
  - guess some translations for unknown words
  - pre-train target language word embeddings with word2vec (on some target language plaintext -- you can also use the target side of the parallel data) and provide the pre-trained embeddings to UDPipe in training; see the UDPipe manual; another good option is to download pre-trained FastText word embeddings from fasttext.cc (use the text format, this is what UDPipe can read in)
  - etc., you can have you own ideas for improvements
Homework:
- implement cross-lingual parsing lexicalized by treebank translation (it is sufficient to use one language pair, either one of the above or your own)
- describe what you did and report achieved LAS scores evaluated on the target language treebank
- doing the simplest baseline lexicalization approach described above carries 2 points
- implementing some of the improvements carries more points

embeddings

Deadline: May 11 3 points

some further resources on cross-lingual mapping of embedding spaces:
- The Facebook research group (Lample et al):
  - https://engineering.fb.com/2018/08/31/ai-research/unsupervised-machine-translation-a-novel-approach-to-provide-fast-accurate-translations-for-more-languages/
  - https://github.com/facebookresearch/MUSE
- The Basque research group (Artetxe et al; Artetxe himself now at Facebook AI):
  - https://github.com/artetxem/vecmap
- A paper on isomorphism assumption (spoiler: typological differences do not seem to lead to non-isomorphism):
  - https://www.aclweb.org/anthology/2020.emnlp-main.257/
monolingual word embeddings: https://fasttext.cc/
- download and install fastText (it is sufficient to make it)
- download and gunzip a model
  - Models for 157 languages
  - download both the bin format and the text format (bin is for the fasttext tools, text for any other usage)
- run fasttext to see all the available options
  - fasttext nn cc.en.300.bin
    - input e.g. "dog" to see words most similar to a dog
  - fasttext analogies cc.en.300.bin
    - input e.g. "teacher school hospital" (which means "teacher - school + hospital") to see what happens when you replace the "schoolness" of a teacher by "hospitalness"
  - embeddings visualisation: https://projector.tensorflow.org/
  - optional:
    - train fasttext embeddings on (first 100,000 lines of) Wikipedia: https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2735
    - visualize embeddings for first 10,000 words in Projector
bilingual embeddings: https://github.com/artetxem/vecmap
- Note: FastText also has published multilingually aligned embeddings: https://fasttext.cc/docs/en/aligned-vectors.html
homework assignment: cross-lingual POS tagging or parsing with bilingual word embeddings
- choose a source language and a target language
- get word embeddings for both the languages
  - download them in text format from FastText website
  - (or train them yourself using e.g. Wikipedia texts if you want to)
- perform a cross-lingual mapping of the embeddings with VecMap (or another tool if you want)
  - if the embeddings files are large, you can just take e.g. the first 100,000 lines from each embeddings file (these will be the 100,000 most frequent words)
  - you can improve the results by using a bilingual dictionary and use the supervised or semi-supervised setting
    - you can extract a bilingual dictionary from parallel data -- e.g. take intersection alignment and construct a dictionary from all aligned pairs of words
    - or you can download dictionaries from OPUS: https://opus.nlpl.eu/ (search for a language pair, and look for a "dic" link in the "dic" column)
  - or you can use the identical or unsupervised setting
  - note that supervised variant runs much faster (~2 minutes) than the other options (~5 hours unless you have a GPU)
  - note that VecMap requires single word to single word dictionaries, so if you have phrases (multiple words corresponding to something) you have to get rid of these somehow
- create one bilingual embeddings file
  - from VecMap, you will get two new embedding files, one for the source language and one for the target language, which contain similar vectors for similar words across languages
  - we need to have one mapping of words in both languages to a common space, to give to UDPipe to use as the representation of words
  - (or we would need to train UDPipe with the crosslingual embeddings for the source language and then exchange its embedding vocabulary for the target language crosslingual embeddings, which may or may not be possible)
  - so we can just concatenate the two files to get bilingual embeddings (ideally, we would be a little clever with the header line and duplicates)
- if you want to do parsing
  - train a UDPipe parser using the bilingual embeddings
```
udpipe --train source.udpipe --tokenizer=none --tagger=none --parser='embedding_form_file=bilingual_embeddings.txt' source.conllu
```
- if you want to do POS tagging
  - train a simple MLP classifier to predict UPOS from embeddings
  - note that these embeddings are static so they cannot handle the fact that some words might pertain to different POS based on context
    - you could do some sequence labelling to take in also the context, but this is not at all required for this assignment
  - look at next class for instructions on training a MLP classifier
- evaluate the trained model
  - evaluate the model on a test treebank for both the source and the target language
  - compare with some meaningful alternatives (e.g. for parsing: delex parser, projected parser, supervised parser...)

bert

Deadline: May 18 3 points

BERT and mBERT by Google: https://github.com/google-research/bert
HuggingFace community makes neural NLP easy to use: https://huggingface.co/
- transformers library allowing you to easily work with Transformer-based models (BERT, GPT2, XLM, XLNet...): https://github.com/huggingface/transformers
I will work with BERT and mBERT
- but feel free to use other models, e.g. DistillBERT (and multilingual DistillmBERT) which are smaller and faster and lighter

Install HuggingFace Transformers (there is a detailed guide on their website; you need to first install Tensorflow or PyTorch)

# virtual environment
python3 -m venv venv
source venv/bin/activate

# install transformers and torch
pip install transformers torch

Get contextual word embeddings!

# Imports
from transformers import BertModel, BertTokenizer
import torch

# Loads the model (downloads it if not yet downloaded)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
# Some valid options:
# bert-base-uncased
# bert-base-cased
# bert-large-cased
# bert-base-multilingual-uncased
# bert-base-multilingual-cased

# Input
sentence = "A platypus is a mammal."

# Tokenize and convert to token ids as pytorch tensor
ids = tokenizer.encode(sentence, return_tensors='pt')

# Let's see the tokens
# Note the special initial and final token
# [0] because technically this is a torch batch of 1 sentence
bert_tokens = tokenizer.convert_ids_to_tokens(ids[0])

# Run the BERT model
output = model(ids)

# See the contextual embedding of the first "A" word
# output[0] is the last encoder layer output
# output[0][0] for the first sentence
# output[0][0][0] is the [CLS] token
# output[0][0][1] is the "A" token
emb_a = output[0][0][1]
emb_mammal = output[0][0][-3]

# Measure cosine similarity of contextual embeddings
from scipy import spatial
def cos_sim(emb1, emb2):
    return 1-spatial.distance.cosine(emb1.detach(), emb2.detach())

# measure cosine similarity of instances of "a"
cos_sim(output[0][0][1], output[0][0][-4])

# Tip: measure cosine similarity of "dog" and "pes" in mBERT?

# Tip: compute sentence representation as sum of tokens?

Let's train a simple BERT-based tagger!

Take some treebanks, e.g. PUD; we need English, and e.g. Czech

Get BERT contextual embeddings for tokens in the treebank using connlu2vectors.py

# Load BERT, read CoNLL-U, for each token write UPOS and contextual embedding
# Does something too stupid to join wordpieces into tokens
# Skips sentences where this fails
./connlu2vectors.py bert-base-uncased < en_pud-ud-test.conllu > en_pud.bert

Train a simple MLP classifier to predict UPOS from contextual embedding using train_mlp.py

# Read data, split into train and test,
# train MLP classifier, report accuracies, save model
./train_mlp.py en_pud.bert.model < en_pud.bert

Apply the classifier to English data, as well as to e.g. Czech data, using eval_mlp.py

# Evaluating on traing data basically...
./eval_mlp.py en_pud.bert.model < en_pud.bert
# Using monolingual (English) BERT on Czech: not good
./eval_mlp.py en_pud.bert.model < cs_pud.bert

Do the same but using multilingual mBERT instead of monolingual

./connlu2vectors.py bert-base-multilingual-uncased < en_pud-ud-test.conllu > en_pud.mbert
./connlu2vectors.py bert-base-multilingual-uncased < cs_pud-ud-test.conllu > cs_pud.mbert
./train_mlp.py en_pud.mbert.model < en_pud.mbert
./eval_mlp.py en_pud.mbert.model < cs_pud.mbert

Now the tagger trained on English language magically works also for Czech!

If there is time: briefly try playing with ChatGPT for multilingual and crosslingual tasks
- Good if you can formulate your task as a text-based task in plain language (e.g. "Is this sentence positive or negative or neutral?") -- LLMs are trained for next word generation and typically also tuned for textual instructions
- SotA for machine translation for some language pairs (especially when translating into English)
- Worth trying if you need to use some specialized terminology (e.g. "Which token is the subject of this sentence: 'On Monday John is walking the dog.'"), but typically better results if you can formulate your task in plain language (e.g. "Who is doing the activity, Monday or John?")
- Nearly useless for complex tasks involving specialized terminology and special formats (e.g. "Give me the CONLL-U analysis of this sentence according to UD 2.0") -- better to split that up into simpler tasks using more plain-language text-based formulations
Homework assignment
- try to do cross-lingual POS tagging with mBERT
- compare several setups
- you can choose a fixed target language and vary the source languages
- you can also combine (concatenate) multiple source languages
- you can try various target languages
- you can try to improve the tokenization mismatch problem
- you can compare mBERT to vecmap
- you can play with the classifier setup
- (if you have access to a GPU, you can also try fine-tuning the mBERT model; but this is very computationally demanding and can take a lot of time, so probably you should not attempt this for the homework assignment)

Conclusion

So can we forget everything we learned in previous classes and just use mBERT?
- Probably in most multilingual and cross-lingual situations, mBERT should be the tool to use.
  - mBERT only covers 104 languages, for languages that are not covered it might work somewhat or not at all
- Generative multilingual LLMs (GPT-4, Mixtral, etc) are worth looking at
  - Again, only some languages are covered, and smaller languages are covered worse than large languages (according to data size, not speaker numbers!)
  - Good if you can formulate your task as text generation, much worse if you need to work with some structured data
  - Not very good for getting contextual embeddings (mBERT typically better)
- Many of the problems are still there (alphabets, tokenization, language similarity...)
- Some approaches are somewhat outdated for the task for which we showed them (e.g. delexicalized parsing) but are useful concepts often used elsewhere (e.g. delexicalization in dialogue systems)
In the course, you should have learned both general transferable stuff as well as specific practical stuff
- Individual tools and methods change fast; you should know and be able to use some of the current tools and methods, but you need to keep learning
- General ideas and approaches transfer; you should be able to apply your understanding of the problematic even with new tools and methods
- Language properties stay; we may have new tools to solve the old problems, but the problems themselves will not go away

enhancing_ud

Deadline: May 18 3 extra points

This is a voluntary assignment, you can do it to gain extra points but you do not have to.

We are spending some time with syntax annotation harmonization (as we have only covered morphological annotation harmonization so far), and with Enhanced UD.
Lab: enhancing Czech UD treebank with information from the tectogrammatical (deep syntactic) annotation in PDT
- Enhanced UD annotation
- PDT tecto-layer annotation guidelines (just for reference -- it is quite huge)
- cs-ud-train-sample.conllu
- cs-tecto-train-sample.conllu
- tecto_header.tsv
- enhance.py
Voluntary homework: try to enhance the Czech UD with something
- Choose any of the phenomena listed in Enhanced UD, and try to enrich the Czech UD annotation with it
- You may add abstract nodes (with non-integer IDs such as 7.1) and/or add secondary dependencies (these go into the DEPS (9th) column, see CoNLL-U format)
- You may use the tecto annotation; there are the same sentences in the tecto file as in the ud file, and the enhance.py script takes care of loading the corresponding pairs of sentences; the most useful columns are probably the ID, the COREF_IDS (IDs of coreference antecedents, separated by pipes "|"), and the EFFHEADS (effective heads, such as the "real" head for all conjuncts, or even multiple heads e.g. for shared modifiers)
- For some phenomena the tecto annotation is probably not needed
- You can also try to work with a different language if you decide to focus on something where you don't need the tecto annotation
- It is not always clear how to do the enhanced UD, and also the tecto annotation is often quite complex, so don't worry if you get lost or confused -- just try to do something, and then submit the code and a commentary on what you tried to do and how and how much you think you succeeded...

License

Unless otherwise stated, teaching materials for this course are available under CC BY-SA 4.0.

Search form

NPFL120 – Multilingual Natural Language Processing

About

Timespace Coordinates

Informal prerequisities

Organization of the course

Lectures

Requirements

Homework tasks

Grading rules

No cheating

License

1. Introduction; WALS

2. Alphabets, encoding, language identification

3. Tokenization and Word Segmentation

4. Interset, POS harmonization

5. Machine Translation (Ondřej Bojar)

6. Cross-lingual POS tagging

7. Delexicalized parsing

8. Tree projection

9. Treebank translation

10. Word Embeddings

11. Contextual Word Embeddings

12. Syntax harmonization and Enhanced Universal Dependencies

13. Multilingual Machine Translation (Ondřej Bojar)

License

Rules

wals

tokenization

pos_harmonization

mt

pos_tagging

delex_parsing

tree_projection

tree_translation

embeddings

bert

Conclusion

enhancing_ud

License