The course focuses on multilingual aspects of natural language processing.
It explains both the issues and the benefits of doing NLP in a multilingual
setting, and shows possible approaches to use. We will target both dealing
with multilingual variety in monolingual methods applied to multiple languages,
as well as truly multilingual and crosslingual approaches which use resources
in multiple languages at once. We will review and work with a range of freely
available multilingual resources, both plaintext and annotated.
About
Timespace Coordinates
- in summer semester 2024, the course takes place every Thursday 14:00-15:30 in SW1
lab at Malá Strana
- videorecordings of classes from last years are also available
Informal prerequisities
We suggest students to first attend the NPFL100 course
Variability of languages in time and space / Variabilita jazyků v čase a prostoru,
which looks more theoretically and linguistically onto many phenomena that
we will look at more practically and computationally.
Some basic programming skills are expected, e.g. from the NPFL125 course Introduction to Language Technologies.
The course complements nicely with the NPFL070 course
Language Data Resources.
Organization of the course
The course has the form of a practical seminar in the computer lab.
In each class we will try to combine a lecture with practical hands-on
exercises (students are therefore required to have a unix lab account).
Lectures
1. Introduction; WALS Slides wals Lecture recording Practicals recording
2. Alphabets, encoding, language identification Slides Lecture recording Practicals recording
3. Tokenization and Word Segmentation Slides tokenization Online class recording
4. Interset, POS harmonization Slides pos_harmonization Online class recording
5. Machine Translation (Ondřej Bojar) Slides mt
6. Cross-lingual POS tagging Slides pos_tagging Online class recording
7. Delexicalized parsing Slides delex_parsing Online class recording (lecture 1:20 to 57:46)
8. Tree projection Tree projection tree_projection Online class recording (lecture 0:49 to 24:35)
9. Treebank translation Treebank translation tree_translation Online class recording (lecture 24:35 to 41:22)
10. Word Embeddings Slides embeddings Online class recording (lecture 0:48 to 49:20)
11. Contextual Word Embeddings Slides bert Online class recording
12. Syntax harmonization and Enhanced Universal Dependencies Slides enhancing_ud
13. Multilingual Machine Translation (Ondřej Bojar) Slides Online class recording The Reality of Multi-Lingual Machine Translation
14. (probably cancelled)
Requirements
Homework tasks
There will be homework from most of the classes, typically based on finishing
and/or extending the exercises from that class.
An important part of each homework assignment is submitting a brief report on
what you did and what you found out. It can be e.g.
just 5 sentences in plaintext, or 5 pages with tables and figures in PDF,
whatever seems appropriate based on what you did.
To pass the course, you will be required to actively participate in the
classes and to submit homework tasks. The quality of your homework
solutions will determine your grade.
Grading rules
You get some points for each homework, typically between 1 and 4. A
standard good solution gets 3 points – a weaker solution gets less, a stronger solution
gets more. Then, if your final average of points per homework is at least 3.0,
you get the grade 1; otherwise you get a lower grade: grade 2 for average
above 2.5; grade 3 for average above 2.0.
No cheating
- Cheating is strictly prohibited and any student found cheating will be punished.
The punishment can involve failing the whole course, or, in grave cases,
being expelled from the faculty.
- Discussing homework assignments with your classmates is OK. Sharing code is
not OK (unless explicitly allowed); by default, you must complete the assignments yourself.
- All students involved in cheating will be punished. E.g. if you share
your assignment with a friend, both you and your friend will be punished.
License
Unless otherwise stated, teaching materials for this course are available under CC BY-SA 4.0.
The plan and deadlines of the assignments is preliminary and may be updated.
Rules
- use any programming language you like (we suggest Python)
- submit
- source codes
- short report
- at least a few sentences, saying what you did, how you did
it, how it worked, what you observed in the results, etc.
- if it makes sense, please also include a sample of the
results/outputs
- it can be a long report if there is a lot to say about what you did
(e.g. 5 pages with tables and figures),
but otherwise a few sentences are sufficient (e.g. 5 sentences in a
plaintext file)
- the report is more important than the source codes: we may or may not
check/run your code, but we will always read the report
- use any reasonable format you like (TXT, PDF, MD, DOC...)
- submit via a Git repository
- create a Git repository somewhere, probably
faculty GitLab
- give read access to the repository to Rudolf and send him the address of
the repository
- the deadline is in ~1.5 weeks by default (in 2024 this means Monday 23:59)
- you will get points
- 3 points is the base for an OK solution
- less points for a bad solution, 0 points for no solution
- more points for a great solution, doing something clever, doing more
work, going deeper, finding something good...
- if your point average is at least 3 at the end of the semester, you get the
grade 1
- otherwise you get a lower grade: grade 2 for average above 2.5; grade 3 for average above 2.0.
- feel free to go deeper
- We are sometimes operating at the edge of current research frontier, so in
any of the assignments there is a chance that you will discover something
new (worth publishing at a scientific conference, or investigating more in a
diploma thesis, etc.)
- So feel free to go as deep as you want in any of the
assignments!
- You can even diverge from the task if you come up with
something more interesting to do. Just follow your fantasy :-) Because this
is how you research.
- You will get more points if you do anything beyond the base task (and if
it is extra interesting, we can talk about publishing it in a scientific
paper).
- But also feel free to simply do the assignment as it is set, this will still
give you 3 points. You can do more, but you do not have to.
wals
Deadline: Mar 11
3 points
- WALS online for clicking
- language.tsv
-- WALS dataset for computer processing (free to download in CSV, this
file has been covnerted to TSV for convenience; but it was generated in
2018 and WALS was updated in the meantime, so you may want to download
the new original WALS dataset instead)
grep
ing and cut
ing in the WALS dataset
- Homework: a script for measuring language
similarity using the WALS dataset
- Idea: similarity of a pair of languages can be estimated by comparing their
WALS features, e.g. by counting the
number of WALS features in which they are similar (Agić, 2017).
The simplest way is to iterate over the features, ignoring those that are
undefined for one of the two languages, and adding 1 to the score if the
values match or 0 if they do not match. If you then divide this by the
number of features, you get the Hamming similarity.
- You can either do the tasks 1-3 (1 is really THE task, 2 and 3 are just simple extensions), or you can do the harder alternative task.
- Task 1: input = WALS code of one language, output = WALS code and
similarity scores for most similar languages.
- Task 2: input = genus (e.g. "Slavic"), output = centroid language of
that genus, i.e. a language most similar to other languages of the genus
- Task 3: find the weirdest language, i.e. most dissimilar to any other
language (for whole WALS, or for a given language genus/family)
- Alternative task: automatically generate missing values in WALS (e.g. if
all Slavic languages have the number of genders either 3 or unspecified,
you can probably set the unspecified values to 3). This is a harder task,
so if you do this one, you do not have to do the tasks 1-3.
- The definition of the task is somewhat vague, feel free to spend as
much or as little time with it as you wish
- Another existing approach: lang2vec
tokenization
Deadline: Mar 18
3 points
pos_harmonization
Deadline: Mar 25
3 points
- Tagset harmonization exercise: You get a syntactic parser trained on
the UD tagset (UPOS and Universal Features), and
data tagged with a different tagset. Try to convert the tagset into the UD
tagset to get better results when applying the parser to the data.
- The data in the CoNLLU format and the trained UDPipe models
can be found at
http://ufallab.ms.mff.cuni.cz/~rosa/npfl120/pos_harm/.
- Running the parser
- To run the parser and get results in the CoNLLU
format, use e.g.:
cat ta-ud-dev-orig.conllu | ./udpipe --parse ta.sup.parser.udpipe
- To view the tree structures in the CoNLLU data, you
can use e.g. view_conll
or Udapi.
- To evaluate the parsing accuracy, use e.g.:
cat ta-ud-dev-orig.conllu | ./udpipe --parse --accuracy ta.sup.parser.udpipe
- This may be a bit confusing, so to clarify:
- For POS harmonization, we only assume to have the first 6 fields of
the CoNLLU available as input.
- We do not do intrinsic evaluation here by evaluating directly the
correctness of the harmonization.
- Rather, we do extrinsic evaluation, evaluating the success of the
harmonization indirectly by using the harmonized tags as input to a
syntactic parser and observing the parsing accuracy.
- To be able to that, the data also contains the other two fields (HEAD
and DEPREL) which is the syntactic annotation, against which we
evaluate the parser.
- Do not use HEAD and DEPREL in your code for the harmonization,
assume that these are not available! I.e. these two columns are sort
of the test data for you.
- The tagset documentations (in practice it is often quite hard
to get a proper documentation for the tagset, but we decided to be
nice to you):
- Try to achieve some reasonable parsing accuracy – I guess at
least 50% should be achievable rather easily.
- Note that 100% accuracy is not reachable; the UAS upper bounds (measured
on UD test data) are:
CS 90%,
DE 85%,
EN 88%,
LA 68%,
TA 78%
- Your task is to try to do the harmonization yourself, not
using any pre-existing tools for that.
- Homework:
- Harmonize the tagset for one of the languages.
- You can use the template
harmonize.py
- Turn in the code that you used.
- Report the parsing accuracy before and after your
harmonization (both UAS and LAS); please measure the
accuracy repeatedly during the development and report
which changes to your solution brought which improvements
of the parsing accuracy.
- The minimum is to identify some of the main POS
categories, such as verbs, nouns, adjectives, and adverbs,
so that you get a reasonable parsing accuracy.
For doing that, you can get 2 points for the homework.
You can get more points if you further improve your
solution; some suggestions are listed below.
- You can try to
identify more POS categories; ideally you should map all
of the original POS tags to some UPOS tags.
- (You can try to produce some of the Universal
Features (documentation)
– but this will most probably not work well, as UDPipe uses
the features as one atomic string.)
- You can try to cover all of the languages, at least in a
basic way.
- You can figure out how to use Interset (see
the lecture), use it to harmonize the tagset, and
compare the parsing accuracy achieved when using your
solution and when using Interset (but you still need to create at least a simple
solution of your own).
mt
3 points
lab instructions: http://ufal.mff.cuni.cz/~zeman/langtech/npfl120/multiling-04-lab.txt
-
Visually compare the left, right and intersection alignments
... check in how many sentences you see the 'garbage alignments' that all
fall onto one word
-
Compare the intersection alignment for the baseline and improved alignments.
-
Write a small script that reads:
- source tokens
- target tokens
- alignment
and emits all pairs of aligned words.
If run through sort | uniq -c | sort -n
, this would be a translation
dictionary.
-
Within the homework assignment, it is sufficient to get up to here, the further steps are optional
-
Continue the moses tutorial to train a phrase-based model (apply
mert-moses.pl
).
-
Apply the trained model.
-
Compare the translations from the default run and from the run with these
model flags:
-dl=0 -max-phrase-length 1
-
Some hints
-
Jacob: I found a solution for the compilation issue. In case any other students are having trouble with modern compilers, this is the solution that worked for me: set the standard to C++03. I did this via an environment variable so I wouldn't have to edit the build files.
export CFLAGS_GLOBAL="-std=c++03"
The homework assignment is voluntary, you can submit it for extra points but
you do not have to.
pos_tagging
Deadline: Apr 8
3 points
- devise a cross-lingual POS tagger for one under-resourced target language
- start here, finish as homework
- report what you did and your POS tagging accuracy on the UD test data
- suggested target language: Kazakh (kk) / Telugu (te)
- there are some training data in UD, but
let's pretend there are none and just use the test data
- there are some reasonable parallel data
- there is at least one reasonable high-resource source language for each of these to project the POS tags from -- choose the source language(s) yourself
- POS projection over (multi)parallel data
-
take parallel data -- I suggest Watchtower and/or OpenSubtitles from
OPUS:
- Watchtower (do
not share)
by Agić+ (2016)
- OPUS by
Tiedemann and Nygaard (2004)
- WTC data is in a multiparallel format
- the same line in all the files corresponds to the same sentence
in the various languages
- but some lines may be empty, as not all sentences are present in
all the files
- some Opus data are multiparallel, but I don't know how to
easily get the multiparallel sentence alignment
- so if you use
multiple sources at once, I suggest you use WTC
-
POS tag the source side of the parallel data
-
you can use the trained UDPipe models
-
tokenize and tag with UDPipe:
udpipe --tokenize --tag path/to/model < input.txt > output.conllu
-
or tag an already tokenized text:
udpipe --tag --input=horizontal path/to/model < input.txt > output.conllu
-
or to only convert tokenized text to CONLLU format:
udpipe --input=horizontal path/to/model < input.txt > output.conllu
-
word-align source and target
-
you can use Giza++ or efmaral
or FastAlign (see below)
- important: you need to have
cmake
for the installation of
FastAlign, so if you don't have it, get it at
https://cmake.org/download/
(or see FastAlign website for instructions)
-
FastAlign installation:
git clone https://github.com/clab/fast_align.git
cd fast_align
mkdir build
cd build
cmake ..
make
-
I suggest to use intersection alignment symmetrization, but you can
play with this a bit
-
FastAlign usage (add -s
to also output alignment scores):
paste cs sk | sed 's/\t/ ||| /' | grep '. ||| .' > cs-sk
fast_align -d -o -v -i cs-sk > cs-sk.f
fast_align -d -o -v -r -i cs-sk > cs-sk.r
atools -i cs-sk.f -j cs-sk.r -c intersect > cs-sk.i
-
project POS tags through the alignment from the tagged source to the non-tagged target
- you can use the template pos_project.py (but it was created for a slightly different purpose so you may need to change it a bit or a lot)
- take inspiration from the lecture to do the projection
- simply copying the POS tag from source to target with no other
tricks is sufficient to get 2 points for the assignment
- you still need to do something with unaligned words or
multiply aligned words (e.g. voting or weighted voting, or
simply use the knowledge that NOUN is usually the most
frequent POS...)
- doing something more clever carries more points
- ideally start with the simple solution, measure the base accuracy,
then implement some improvements, and repeatedly measure the
increase in accuracy (if any)
-
train tagger on the target data
udpipe --train --tokenizer=none --parser=none --tagger='use_xpos=0;use_features=0' output.model < input.conllu
-
evaluate the tagger on target test data
udpipe --tag --accuracy path/to/model < test.conllu
- other notes (not important for this HW)
- you can use HunAlign
sentence aligner if you use parallel data that are not
sentence-aligned:
install_hunalign.sh_,
hun_align.sh_
- some data in Opus are weird; OpenSubtitles and Tanzil are nice
- once you have word-aligned data, you can also extract a simple
word-to-word translation dictionary (this single-best translation is weaker
than e.g. Moses as it does not take the context into account)
delex_parsing
Deadline: Apr 15
3 points
- applying lexicalized versus delexicalized parsers in a monolingual and cross-lingual setting
-
trained lexicalized ("sup") and delexicalized ("delex")
UDPipe 1.2 models
trained on UD 2.1 treebanks
-
language groups for experimenting:
- Norwegian (no), Danish (da), Swedish (sv)
- Czech (cs), Slovak (sk)
- Spanish (es), Portuguese (pt)
-
UD treebanks
-
evaluating a trained UDPipe parser on test treebank data (only parsing,
no tagging!):
udpipe --parse --accuracy path/to/model < test.conllu
-
training a delexicalized UDPipe parser (without morpho features); the
last parameter (cs.delex.parser.udpipe
) is on output parameter, i.e.
udpipe will create this file and store the model into it:
cat cs-ud-train.conllu | ./udpipe --train --parser='embedding_form=0;embedding_feats=0;' --tokenizer=none --tagger=none cs.delex.parser.udpipe
- Homework:
- Extended your cross-lingual POS tagging homework to cross-lingual parsing
- Train a delexicalized parser on a source language treebank, and apply it to your cross-lingually-POS-tagged target-language data
- Report the parsing accuracies you obtain (LAS and UAS)
- You may also try the source-lexicalized parsing:
- Train a standard lexicalized parser (but still without morpho features) on the source language
- Apply it to the target language (without any translation)
- This will only work well if there is a substantial amount of shared
vocabulary between the source and the target language,
i.e. they are lexically very close
- Other notes -- combining multiple parsers via the MST algorithm (you do not
have to do this in this HW):
- parse a sentence with mutiple parsers -- you get multiple parse trees,
i.e. 3 sets of dependency edges if you used 3 parsers
- assign weights to the edges (e.g. 1
if the edge appeared in one parser output, 2 if in 2, etc.; or
incorporate language similarity into the weights as well, i.e. edges
from less similar languages get a lower weight)
- give the list of edges and their weights to a MST
algorithm, which outputs the best tree that can be constructed
from the edges
- you can use my Perl wrapper of the Perl
Graph::ChuLiuEdmonds
library (or look at my code and use the library directly from your
Perl code; unfortunately I am unaware of any good implementation
of a directed MST algorithm in Python)
-
my wrapper takes standard input (one sentence per line) and
writes to standard output (one sentence per line
-
the input format is
number_of_nodes parent child weight parent child weight...
where parent and child are 1-based integer IDs of the parent and child
nodes of the edge and the weight is a weight you assign to the
edge, so e.g.:
3 0 2 1.5 2 1 0.5 2 3 0.5 3 1 1.2
tree_projection
Deadline: Apr 22
3 points
- Projecting trees over parallel data:
- all
data is here:
PUD = parallel treebanks, align = alignments by FastAlign
- Beware: CONLL-U token IDs are 1-based, FastAlign token IDs are
0-based
- Beware: tokens with non-integer ID (like 5-6 or 8.1) are not part of
the tree nor of the alignment (so maybe you can just grep them
away)
- Beware: forms and lemmas can contain spaces in CONLL-U
- You can use the template project.py which I prepared (it does
the reading in and writing out)
- because this is parallel treebank, you have gold standard
annotation for both the source tree and the target tree, so you
can measure the accuracy of your projection (in real life, you have
parallel data which do not have any annotation, so you need to parse
the source data with a parser, and then train a parser on the target
data)
- you can use e.g. my evaluator.py for that
- use it e.g. as
python3 evaluator.py -j -m head gold.conllu pred.conllu
- run it as
python3 evaluator.py -h
for more
info; most importantly, you can also use
-m deprel
or -m las
- Homework:
- implement the projections somehow
- try to ensure that what you produce is a rooted tree (only one
root, all nodes have a head assigned, no cycles); report how you did
this and if you succeeded
- also predict deprels somehow (projecting is good, looking at POS
can also be good)
- evaluate your solution automatically for several language pairs and report the scores
- ideally, also compare the accuracies to the delex parsing approach
- report what you found out
tree_translation
Deadline: Apr 29
3 points
- Lab: cross-lingual parsing lexicalized by translation of the training
treebank using machine translation
- we get back to the VarDial 2017 cross-lingual parsing shared task setup: 3 language
pairs (one is actually a triplet), using supervised POS tags:
- Czech (cs) source, Slovak (sk) target
- Slovene (sl) source, Croatian (hr) target
- Danish (da) and/or Swedish (sv) source, Norwegian (no) target
- choose any language pair you want to, or use other languages if you want to
- for the language pairs above, some datasets
are prepared for the lab (but that's only a minor convenience, you can
simply use UD treebanks and e.g. OpenSubtitles or WatchTower parallel data
for any languages)
- "treebanks" are the training treebanks for the source
languages and evaluation treebanks for the target
languages
- "smaller_delex_models" are the baselines, i.e.
delexicalized UDPipe parsers trained on the first 4096
sentences from the training treebanks; apply them to the
target evaluation treebanks to measure the baseline
accuracy (around 55 LAS I think)
- "our_vardial_models" are lexicalized parsing models
which we submitted into the competition, about +5 LAS
above the baselines (can you beat us?! :-))
- "para" are parallel data, obtained from
OpenSubtitles2016 aligned by MonolingualGreedyAligner with
intersection symmetrization (the format of the data is
"sourceword[tab]targetword" on each line); there are also
"tag" variants where POS tag and morphological features
are annotated for the source word
- "translate_treebank.py" is a simple implementation of
treebank translation which you can use for your
inspiration
- the baseline approach is to translate each word form in the
source treebank (second column) by its most frequent target
counterpart from the parallel data (as done by the sample
"translate_treebank.py" script),
and then train a standard UDPipe parser on that:
udpipe --train --tokenizer=none --tagger=none out.model < train.conllu
and evaluating the parser on the target evaluation treebank:
udpipe --parse --accuracy out.model < dev.conllu
- there are many possible improvements to the approach:
- use better word alignment (e.g. FastAlign intersection
alignment)
- use the source POS tags and/or morphological features
for source-side disambiguation -- e.g. the word "stát" in
Czech should be translated differently as a noun ("state")
and as a verb ("stand"); you already have this annotation
in the source treebank, and you can get it in the parallel
data using a UDPipe tagger trained on the source treebank
(which is how we produced the "tag" variants of the para
data, which you can use)
- use multiple source languages -- either combine the
parsers using the MST algorithm, or simply concatenate the
source treebanks into one (that's what we did in VarDial
for Danish and Swedish -- if you see the "ds" language
code, this means just that)
- use a proper MT system (word-based Moses probably?)
- use your knowledge of the target language for some
additional processing
- guess some translations for unknown words
- pre-train target language word embeddings with
word2vec (on some target language plaintext -- you can
also use the target side of the parallel data) and provide
the pre-trained embeddings to UDPipe in training; see the
UDPipe manual;
another good option is to download pre-trained FastText word embeddings from
fasttext.cc
(use the text format, this is what UDPipe can read in)
- etc., you can have you own ideas for improvements
- Homework:
- implement cross-lingual parsing lexicalized by treebank
translation (it is sufficient to use one language pair, either one
of the above or your own)
- describe what you did and report achieved LAS scores evaluated
on the target language treebank
- doing the simplest baseline lexicalization approach described
above carries 2 points
- implementing some of the improvements carries more points
embeddings
Deadline: May 6
3 points
- some further resources on cross-lingual mapping of embedding spaces:
- The Facebook research group (Lample et al):
- The Basque research group (Artetxe et al; Artetxe himself now at Facebook AI):
- A paper on isomorphism assumption (spoiler: typological differences do not seem to lead to non-isomorphism):
- monolingual word embeddings: https://fasttext.cc/
- download and install fastText (it is sufficient to
make
it)
- download and gunzip a model
- Models for 157 languages
- download both the
bin
format and the text
format (bin
is for the
fasttext
tools, text
for any other usage)
- run
fasttext
to see all the available options
fasttext nn cc.en.300.bin
- input e.g. "dog" to see words most similar to a dog
fasttext analogies cc.en.300.bin
- input e.g. "teacher school hospital"
(which means "teacher - school + hospital")
to see what happens when you replace the "schoolness" of
a teacher by "hospitalness"
- embeddings visualisation: https://projector.tensorflow.org/
- optional:
- bilingual embeddings: https://github.com/artetxem/vecmap
- homework assignment: cross-lingual POS tagging or parsing with bilingual word embeddings
- choose a source language and a target language
- get word embeddings for both the languages
- download them in text format from FastText website
- (or train them yourself using e.g. Wikipedia texts if you want to)
- perform a cross-lingual mapping of the embeddings with VecMap (or another
tool if you want)
- if the embeddings files are large, you can just take e.g. the first
100,000 lines from each embeddings file (these will be the 100,000 most
frequent words)
- you can improve the results by using a bilingual dictionary
and use the
supervised
or semi-supervised
setting
- you can extract a bilingual dictionary from parallel data --
e.g. take intersection alignment and construct a
dictionary from all aligned pairs of words
- or you can download dictionaries from OPUS: https://opus.nlpl.eu/
(search for a language pair, and look for a "dic" link in the "dic"
column)
- or you can use the
identical
or unsupervised
setting
- note that
supervised
variant runs much faster (~2 minutes) than the
other options (~5 hours unless you have a GPU)
- note that VecMap requires single word to single word dictionaries, so if
you have phrases (multiple words corresponding to something) you have to
get rid of these somehow
- create one bilingual embeddings file
- from VecMap, you will get two new embedding files, one for the source language and
one for the target language, which contain similar vectors for
similar words across languages
- we need to have one mapping of words in both languages to a common
space, to give to UDPipe to use as the representation of words
- (or we would need to train UDPipe with the crosslingual embeddings for
the source language and then exchange its embedding vocabulary for the
target language crosslingual embeddings, which may or may not be
possible)
- so we can just concatenate the two files to get bilingual embeddings (ideally, we would be a little
clever with the header line and duplicates)
- if you want to do parsing
- if you want to do POS tagging
- train a simple MLP classifier to predict UPOS from embeddings
- note that these embeddings are static so they cannot handle the fact
that some words might pertain to different POS based on context
- you could do some sequence labelling to take in also the context, but this is not at all required for this assignment
- look at next class for instructions on training a MLP classifier
- evaluate the trained model
- evaluate the model on a test treebank for both the source and the target
language
- compare with some meaningful alternatives (e.g. for parsing: delex parser, projected
parser, supervised parser...)
bert
Deadline: May 13
3 points
-
BERT and mBERT by Google: https://github.com/google-research/bert
-
HuggingFace community makes neural NLP easy to use: https://huggingface.co/
-
I will work with BERT and mBERT
- but feel free to use other models, e.g. DistillBERT (and multilingual DistillmBERT)
which are smaller and faster and lighter
-
Install HuggingFace Transformers (there is a detailed guide on their
website; you need to first install Tensorflow or PyTorch)
# virtual environment
python3 -m venv venv
source venv/bin/activate
# install transformers and torch
pip install transformers torch
-
Get contextual word embeddings!
# Imports
from transformers import BertModel, BertTokenizer
import torch
# Loads the model (downloads it if not yet downloaded)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
# Some valid options:
# bert-base-uncased
# bert-base-cased
# bert-large-cased
# bert-base-multilingual-uncased
# bert-base-multilingual-cased
# Input
sentence = "A platypus is a mammal."
# Tokenize and convert to token ids as pytorch tensor
ids = tokenizer.encode(sentence, return_tensors='pt')
# Let's see the tokens
# Note the special initial and final token
# [0] because technically this is a torch batch of 1 sentence
bert_tokens = tokenizer.convert_ids_to_tokens(ids[0])
# Run the BERT model
output = model(ids)
# See the contextual embedding of the first "A" word
# output[0] is the last encoder layer output
# output[0][0] for the first sentence
# output[0][0][0] is the [CLS] token
# output[0][0][1] is the "A" token
emb_a = output[0][0][1]
emb_mammal = output[0][0][-3]
# Measure cosine similarity of contextual embeddings
from scipy import spatial
def cos_sim(emb1, emb2):
return 1-spatial.distance.cosine(emb1.detach(), emb2.detach())
# measure cosine similarity of instances of "a"
cos_sim(output[0][0][1], output[0][0][-4])
# Tip: measure cosine similarity of "dog" and "pes" in mBERT?
# Tip: compute sentence representation as sum of tokens?
-
Let's train a simple BERT-based tagger!
-
Take some treebanks, e.g.
PUD;
we need English, and e.g. Czech
-
Get BERT contextual embeddings for tokens in the treebank using
connlu2vectors.py
# Load BERT, read CoNLL-U, for each token write UPOS and contextual embedding
# Does something too stupid to join wordpieces into tokens
# Skips sentences where this fails
./connlu2vectors.py bert-base-uncased < en_pud-ud-test.conllu > en_pud.bert
-
Train a simple MLP classifier to predict UPOS from contextual embedding using
train_mlp.py
# Read data, split into train and test,
# train MLP classifier, report accuracies, save model
./train_mlp.py en_pud.bert.model < en_pud.bert
-
Apply the classifier to English data, as well as to e.g. Czech data, using
eval_mlp.py
# Evaluating on traing data basically...
./eval_mlp.py en_pud.bert.model < en_pud.bert
# Using monolingual (English) BERT on Czech: not good
./eval_mlp.py en_pud.bert.model < cs_pud.bert
-
Do the same but using multilingual mBERT instead of monolingual
./connlu2vectors.py bert-base-multilingual-uncased < en_pud-ud-test.conllu > en_pud.mbert
./connlu2vectors.py bert-base-multilingual-uncased < cs_pud-ud-test.conllu > cs_pud.mbert
./train_mlp.py en_pud.mbert.model < en_pud.mbert
./eval_mlp.py en_pud.mbert.model < cs_pud.mbert
-
Now the tagger trained on English language magically works also for Czech!
-
If there is time: briefly try playing with ChatGPT for multilingual and
crosslingual tasks
- Good if you can formulate your task as a text-based task in plain
language (e.g. "Is this sentence positive or negative or neutral?") --
LLMs are trained for next word generation and typically also tuned for
textual instructions
- SotA for machine translation for some language pairs (especially when
translating into English)
- Worth trying if you need to use some specialized terminology (e.g.
"Which token is the subject of this sentence: 'On Monday John is walking the dog.'"), but typically better results if
you can formulate your task in plain language (e.g. "Who is doing the
activity, Monday or John?")
- Nearly useless for complex tasks involving specialized terminology and
special formats (e.g. "Give me the CONLL-U analysis of this sentence
according to UD 2.0") -- better to split that up into simpler tasks
using more plain-language text-based formulations
-
Homework assignment
- try to do cross-lingual POS tagging with mBERT
- compare several setups
- you can choose a fixed target language and vary the source languages
- you can also combine (concatenate) multiple source languages
- you can try various target languages
- you can try to improve the tokenization mismatch problem
- you can compare mBERT to vecmap
- you can play with the classifier setup
- (if you have access to a GPU, you can also try fine-tuning the mBERT
model; but this is very computationally demanding and can take a lot of
time, so probably you should not attempt this for the homework assignment)
Conclusion
- So can we forget everything we learned in previous classes and just use mBERT?
- Probably in most multilingual and cross-lingual situations, mBERT should
be the tool to use.
- mBERT only covers 104 languages, for languages that are not covered it
might work somewhat or not at all
- Generative multilingual LLMs (GPT-4, Mixtral, etc) are worth looking at
- Again, only some languages are covered, and smaller languages are
covered worse than large languages (according to data size, not speaker
numbers!)
- Good if you can formulate your task as text generation, much worse if
you need to work with some structured data
- Not very good for getting contextual embeddings (mBERT typically better)
- Many of the problems are still there (alphabets, tokenization, language
similarity...)
- Some approaches are somewhat outdated for the task for which we showed them (e.g.
delexicalized parsing) but are useful concepts often used elsewhere (e.g.
delexicalization in dialogue systems)
- In the course, you should have learned both general transferable stuff as well
as specific practical stuff
- Individual tools and methods change fast; you should know and be able to
use some of the current tools and methods, but you need to keep learning
- General ideas and approaches transfer; you should be able to apply your
understanding of the problematic even with new tools and methods
- Language properties stay; we may have new tools to solve the old problems,
but the problems themselves will not go away
enhancing_ud
Deadline: May 15
3 extra points
This is a voluntary assignment, you can do it to gain extra points but you do not have to.
- We are spending some time with syntax annotation harmonization (as we
have only covered morphological annotation harmonization so far), and with
Enhanced UD.
- Lab: enhancing Czech UD treebank with information from the
tectogrammatical (deep syntactic) annotation in PDT
- Voluntary homework: try to enhance the Czech UD with
something
- Choose any of the phenomena listed in Enhanced UD, and try to
enrich the Czech UD annotation with it
- You may add abstract nodes (with non-integer IDs such as 7.1)
and/or add secondary dependencies (these go into the DEPS (9th)
column, see CoNLL-U
format)
- You may use the tecto annotation; there are the same sentences
in the tecto file as in the ud file, and the enhance.py script takes care
of loading the corresponding pairs of sentences; the most useful columns
are probably the ID, the COREF_IDS (IDs of coreference antecedents,
separated by pipes "|"), and the EFFHEADS (effective heads, such as the
"real" head for all conjuncts, or even multiple heads e.g. for shared
modifiers)
- For some phenomena the tecto annotation is probably not needed
- You can also try to work with a different language if you
decide to focus on something where you don't need the tecto
annotation
- It is not always clear how to do the enhanced UD, and also the
tecto annotation is often quite complex, so don't worry if you get
lost or confused -- just try to do something, and then submit the
code and a commentary on what you tried to do and how and how much
you think you succeeded...
License
Unless otherwise stated, teaching materials for this course are available under CC BY-SA 4.0.