SIS code:

Semester:

summer

E-credits:

Examination:

C + Ex

Instructor:

Zdeněk Žabokrtský

NPFL124 – Natural Language Processing

About

NPFL124 provides students with knowledge and hands-on experience related to basic (mostly statistical) methods in the field of Natural Language Processing. The students will be acquainted with fundamental components such as corpora and language modes, as well as with complex end-user applications such as Machine Translation.

The course consists of six two-week thematic blocks taught by five lecturers:

Jindřich Helcl: Statistical fundamentals of NLP
Daniel Zeman: Morphological and syntactic analysis
Pavel Pecina: Information retrieval
Jindřich Libovický: Deep learning in NLP
Ondřej Bojar: Machine translation
Zdeněk Žabokrtský: NLP data resources and evaluation

Scheduled time and location

lectures on Thursdays at 10:40 in S3, every week starting from the first week of the summer semester (note that two Thursdays in May are cancelled because of the state holidays)
practicals on Wednesdays at 9:00, in SU2 in odd weeks of the semester (starting from the 3rd week), in S4 in even weeks of the semester (starting from the second week)

Lectures

1. Introduction Intro to NLP Questions

2. Language modeling. Language Models Questions

3. Morphological analysis Morphology Questions

4. Syntactic analysis Syntax Questions

5. Information retrieval IR Assignment on IR

6. Information retrieval, cont. IR cont. Questions

7. Introduction to Deep Learning in NLP Deep learning intro Recording Assignment on NN interpretation

8. Deep learning applications in NLP DL in applications LLMs Recording Questions

9. Machine translation MT intro+Word Alignment+PBMT Word Alignment by Philipp Koehn Recording Lab: IBM1 Word Alignment

10. Machine translation, cont. Main Slides: Neural MT Extra Slides: Transformer Recording Questions

11. Overview of Language Data Resources Data resources Questions

12. Evaluation measures in NLP Evaluation Questions

13. Early exam

License

Unless otherwise stated, teaching materials for this course are available under CC BY-SA 4.0.

1. Introduction

February 20, 2025 Intro to NLP Questions

Lecturer: Jindřich Helcl

Topics:

Motivation for NLP.
Basic notions from probability and information theory.

2. Language modeling.

February 27, 2025 Language Models Questions

Lecturer: Jindřich Helcl

Topics:

Language models.
The noisy channel model.
Markov models.

3. Morphological analysis

March 6, 2025 Morphology Questions

Lecturer: Daniel Zeman

Topics:

Morphological tags, parts of speech, morphological categories.
Finite-state morphology.

(Slides covered down to no. 46. To be completed next week.)

Practicals:

Czech National Corpus
- KonText help
- CQL help, or also here
- SyD: historical timeline of two competing words, or comparison in current language
- WaG / Slovo v kostce
Lindat services
- Lindat KonText
- TEITOK UD 2.14
  - All UD treebanks can be searched in one query. In addition to CQL syntax, one can restrict the search to some language codes (text_langcode) or language names (text_lang): [word="her" & upos="(PRON|DET)" & feats=".Gender=Fem."] :: match.text_lang != "Naija" & match.text_project != "ParTUT"
- PML-TQ
- UDPipe
Morfo Czech inflection generator
Nástroje na ÚFALu (an old tutorial for students of linguistics in Olomouc)

4. Syntactic analysis

March 13, 2025 Syntax Questions

Lecturer: Daniel Zeman

Topics:

Dependency vs. phrase-based model.
Dependency parsing.

5. Information retrieval

March 20, 2025 IR Assignment on IR

Lecturer: Pavel Pecina

Topics:

Intro to IR.
Boolean model.
Inverted index.

6. Information retrieval, cont.

March 27, 2025 IR cont. Questions

Lecturer: Pavel Pecina

Topics:

Probabilistic models for Information Retrieval.

7. Introduction to Deep Learning in NLP

April 3, 2025 Deep learning intro Recording Assignment on NN interpretation

Lecturer: Jindřich Libovický

Topics:

Neural network basics
Word embeddings, sequence-processing architectures
Pre-trained models: Word2Vec, BERT

The excercise is available in a Google Colab Sheet.

8. Deep learning applications in NLP

April 10, 2025 DL in applications LLMs Recording Questions

Lecturer: Jindřich Libovický

Topics:

Named entity recognition
Answer span selection
Generative language models

9. Machine translation

April 17, 2025

MT intro+Word Alignment+PBMT Word Alignment by Philipp Koehn Recording

Lab: IBM1 Word Alignment

Lecturer: Ondřej Bojar

Topics:

Introduction to MT.
MT evaluation.
Alignment.
Phrase-Based MT.

Additonal materials:

10. Machine translation, cont.

April 24, 2025 Main Slides: Neural MT Extra Slides: Transformer Recording Questions

Topics:

Fundamental problems of PBMT.
Neural machine translation (NMT).
- Brief summary of NNs.
- Sequence-to-sequence, with attention.
- Transformer, self-attention.
- Linguistic features in NMT.

11. Overview of Language Data Resources

May 14, 2025, exceptionally Wednesday!!! exceptionally lecture in S4!!! Data resources Questions

Lecturer: Zdeněk Žabokrtský

Lecture topics:

Types of language data resources.
Annotation principles.

Practicals:

Some already existing language data resources can be found using:
playground for playing the language data resource game

12. Evaluation measures in NLP

May 15, 2025 Evaluation Questions

Lecturer: Zdeněk Žabokrtský

Topics:

Purposes of evaluation.
Evaluation best practices, estimating upper and lower bounds.
Task-specific measures.

13. Early exam

Date: May 22, 2025

The first option for passing the final exam written test ("předtermín").
Additional exam dates can be offered in the exam period.

1. Language Identification

2. Information Retrieval

3. Analysis of a Trained Model for Sentiment Classifation

1. Language Identification

Deadline: 4th April 2025, 23:59 Submission form

This assignment is an application of the topics covered in lectures 1 and 2. Your task is to gather text data from various online sources and in multiple languages and to train n-gram language models to identify the language the texts are in.

The submissions will consist of a single IPython notebook (preferably a link to Google Colab), plus a filled in checklist.

Include also any code used for data gathering. In case it is not trivial to replicate the data gathering phase, you might consider putting the resulting dataset on a publicly accessible URL (such as the public_html folder in your lab account) and calling !wget from the IPython notebook to retrieve it.

Proceed in the following steps:

Gather plain text data in multiple languages, save each in separate files, one file per language. The language choice is up to you - it should be at least two languages plus English. If you choose to work with languages that do not use the Latin script, you can replace English by a third language; in all cases please only work with languages that share the same script (language-specific characters like "ř" in Czech are fine).
Tokenize everything (you can use the Sacremoses library for this).
Report the size of the data in tokens and bytes. You should collect at least 200k tokens per language.
Split your data into training, heldout and test sets. (Use 80% of the data for training and 10% for heldout and test.)
Estimate the unigram, bigram, and trigram probabilities of character n-grams in each language separately. (Use the conditional distributions for higher-level n-grams to be able to apply the Markovian property to estimate probabilities of sequences.)
Report 5 most common character trigrams per language, along with their counts and relative frequencies (count divided by the size of the data).
Estimate the "add less than one" smoothing parameter (described in slide 16 of Lecture 2) for the trigram language model. Remember to use the heldout set for this!
Report the values of the smoothing parameters (one per language).
Calculate the cross-entropies of all (trigram) language models on all test sets.
Write a function to identify language by comparing probabilities given by your models (the highest probability wins). This function should accept a string (of arbitrary length, containing more words or sentences) and return a list of pairs (probability, language) ordered by the probability, highest first.
Submit everything using the submission form.

2. Information Retrieval

Deadline: 2nd May 2025, 23:59 Submission form

The goal of the assignment is to create and evaluate inverted indices for Czech and English document collections, implement Boolean query operators, process queries, and analyze the results to compare the performance of the system for both languages.

Step 1: Create Inverted Index

Create separate inverted indices for the provided document collections in Czech and English using a hash table for the dictionary. Each term in the dictionary should map to a list of document IDs (postings) where the term appears.

Parse each document in the collection to extract terms from the text fields.
Normalize the terms by converting them to lowercase, removing punctuation, and applying stemming or lemmatization (!).
For each term, add the document IDs to the postings lists in lexicographical order.

Step 2: Implement Boolean Query Operators

Implement Boolean query operators for two terms: x AND y, x OR y, and x AND NOT y by iterating over the sorted postings lists.

Step 3: Process Boolean Queries

Process the Boolean queries for the provided topics separately for Czech and English and generate the results.

Parse the topic file to extract queries.
Tokenize and normalize the terms in the queries the same way as for the document collection.
Use the inverted index to retrieve document IDs that match each query using the implemented operators.
Save the results for each query in the required format.
Ensure the results are saved in files named results-cs.dat and results-en.dat for evaluation for the Czech and English results respectively.

Step 4: Evaluate Results

Evaluate your results using the provided relevance assessments separately for English and Czech.

For each query, compute the number of retrieved documents, the number of relevant documents retrieved, Precision, and Recall.
Average the values of all the scores across all queries to obtain the overall scores.

Step 5: Submission

The submission will consist of a single IPython notebook (preferably a link to Google Colab) and a filled-in checklist.
Comment your code, report the results for each query and the overall scores (Precision, Recall).
Discuss the results and compare the performance of the system for both languages.
Submit using the Submission Form (include answers to all questions) by the given deadline.

3. Analysis of a Trained Model for Sentiment Classifation

Deadline: 16th May 2024, 23:59 Submission form

In this assignment, you will analyze the weights of a trained neural network. In the practical following Lecture 7, you trained several classifiers for sentiment analyses. Your goal in this assignment will be to interpret the weight of one of the networks you trained in the pracitals: Model 2 based on 1D convolution. If you did not manage to finish model in the practical or you are unsure about your solution, you will receive a reference solution of the CNN-based model on April 30 via email from SIS (email the instructor if not).

The first step in the convolution is multiplying the word embeddings with a weight matrix to analyze the response of convolutional filters. The output of this multiplication can be considered as a measure of how strongly the embeddings match the weight vectors in the convolution, so-called filters. These are the values that you will work with.

Using the input word embeddings (you will likely find them in model.embeddings.weight) and the convolutional filter weights (likely in model.conv[0][1].weight), find the tokens that lead to the highest filter responses. The response is computed as a dot product of the respective word embeddings and vectors from the weight matrices (you wil have to transpose the weights correctly, then you can find the best-scoring ones using topk function, think of setting the correct dimension). For simplicity, you can only work with kernels of size 1 but feel free to consider longer spans too. (Method tokenizer.convert_ids_to_tokes might be useful to convert the indices back to tokens.) [50% of the assignment]
Look at the results and qualitatively assess what words appear among the best-scoring ones. Write a few paragraphs of 100 to 400 words. [20% of the assignment]
Analyze what POS triggers the convolutional filters the most: compute a statistic how often different POS appear among the best scoring words. For each word, only consider the most frequent POS tag. (You can get the most frequent POS tags, e.g., from the English Web Treebank.) Speculate about the reasons for the statistics that you observe. Present your results in a table and write your thoughts and comments in at most 200 words. [30% of the assignment]

Feel free to use the Colab Notebook from the practicals as a starting point. You can save the weights your trained model into your Google Drive and load them from a file, so you do not have train the model every time you work on the assignment.

Please write all your code in a replicable way into the notebook. Interleave the code with the text of your analysis and answer the questions. Write 1-2 paragraphs to each of the points in English, Czech or Slovak. Please submit a sharing link to your notebook via the following form.

Pool of possible exam questions

All variants of the final written exam tests will be assembled exclusively from questions selected from the following list:

(warning: the question list might be subject to occasional changes during the semester; the final version will be announced here no later than three weeks before the first exam date.)

Basic notions from probability and information theory.

What are the three basic properties of a probability function? (1 point)
When do we say that two events are (statistically) independent? (1 point)
Show how Bayes' Theorem can be derived. (1 point)
Explain Chain Rule. (1 point)
Explain the notion of Entropy (formula expected too). (1 point)
Explain Kullback-Leibler distance (formula expected too). (1 point)
Explain Mutual Information (formula expected too). (1 point)

Language models. The noisy channel model.

Explain the notion of The Noisy Channel. (1 point)
Explain the notion of the n-gram language model. (1 point)
Describe how Maximum Likelihood estimate of a trigram language model is computed. (2 points)
Why do we need smoothing (in language modelling)? (1 point)
Give at least two examples of smoothing methods. (2 points)

Morphological analysis.

What is a morphological tag? List at least five features that are often encoded in morphological tag sets. (1 point)
List the open and closed part-of-speech classes and explain the difference between open and closed classes. (1 point)
Explain the difference between a finite-state automaton and a finite-state transducer. Describe the algorithm of using a finite-state transducer to transform a surface string to a lexical string (pseudocode or source code in your favorite programming language). (2 points)
Give an example of a phonological or an orthographical change caused by morphological inflection (any natural language). Describe the rule that would take care of the change during analysis or generation. It is not required that you draw a transducer, although drawing a transducer is one of the possible ways of describing the rule. (1 point)
Give an example of a long-distance dependency in morphology (any natural language). How would you handle it in a morphological analyzer? (1 point)

Syntactic analysis.

Describe dependency trees, constituent trees, differences between them and phenomena that must be addressed when converting between them. (2 points)
Give an example of a sentence (in any natural language) that has at least two plausible, semantically different syntactic analyses (readings). Draw the corresponding dependency trees and explain the difference in meaning. Are there other additional readings that are less probable but still grammatically acceptable? (2 points)
What is coordination? Why is it difficult in dependency parsing? How would you capture coordination in a dependency structure? What are the advantages and disadvantages of your solution? (1 point)
What is ellipsis? Why is it difficult in parsing? Give examples of different kinds of ellipsis (any natural language). (1 point)

Information retrieval.

Explain the difference between information need and query. (1 point)
What is inverted index and what are the optimal data structures for it? (1 point)
What is stopword and what is it useful for? (1 point)
Explain the bag-of-word principle? (1 point)
What is the main advantage and disadvantage of boolean model. (1 point)
Explain the role of the two components in the TF-IDF weighting scheme. (1 point)
Explain length normalization in vector space model what is it useful for? (1 point)

Language data resources.

Explain what a corpus is. (1 point)
Explain what annotation is (in the context of language resources). What types of annotation do you know? (2 points)
What are the reasons for variability of even basic types of annotation, such as the annotation of morphological categories (parts of speech etc.).(1 point)
Explain what a treebank is. Why trees are used? (2 points)
Explain what a parallel corpus is. What kind of alignments can we distinguish? (2 points)
What is a sentiment-annotated corpus? How can it be used? (1 points)
What is a coreference-annotated corpus? (1 points)
Explain how WordNet is structured? (1 points)
Explain the difference between derivation and inflection? (1 points)

Evaluation measures in NLP.

Give at least two examples of situations in which measuring a percentage accuracy is not adequate. (1 point)
Explain: precision, recall (1 point)
What is F-measure, what is it useful for? (1 point)
What is k-fold cross-validation ? (1 point)
Explain BLEU (the exact formula not needed, just the main principles). (1 point)
Explain the purpose of brevity penalty in BLEU. (1 point)
What is Labeled Attachment Score (in parsing)? (1 point)
What is Word Error Rate (in speech recognition)? (1 point)
What is inter-annotator agreement? How can it be measured? (1 point)
What is Cohen's kappa? (1 point)

Deep learning for NLP.

Describe the two methods for training of the Word2Vec model. (1 point)
Use formulas to describe how Word2Vec is trained with negative sampling. (2 poitns)
Explain the difference between Word2Vec and FastText embeddings. (1 point)
Sketch the structure of the Transformer model. (2 points)
Why do we use positional encodings in the Transformer model. (1 point)
What are residual connections in neural networks? Why do we use them? (1 point)
Use formulas to express the loss function for training sequence labeling tasks. (1 point)
Explain the pre-training procedure of the BERT model. (2 points)
Explain what is the pre-train and finetune paradigm in NLP. (1 points)
Describe the task of named entitity recognition (NER). Explain the intution behind the CRF models compared to standard sequence labeling. (2 points)
Explain how does the self-attention differ in encoder-only and decoder-only models. (1 point)

Machine translation fundamentals.

Why is MT difficult from linguistic point of view? Provide examples and explanation for at least three different phenomena. (2 points)
Why is MT difficult from computational point of view? (1 point)
Briefly describe at least three methods of manual MT evaluation. (1-2 points)
Describe BLEU. 1 point for the core properties explained, 1 point for the (commented) formula.
Describe IBM Model 1 for word alignment, highlighting the EM structure of the algorithm. (1 point)
Explain using equations the relation between Noisy channel model and log-linear model for classical statistical MT. (2 points)
Describe the loop of weight optimization for the log-linear model as used in phrase-based MT. (1 point)

Neural machine translation.

Describe the critical limitation of PBMT that NMT solves. Provide example training data and example input where PBMT is very likely to introduce an error. (1 points)
Use formulas to highlight the similarity of NMT and LMs. (1 point)
Describe, how words are fed to current NMT architectures and explain why is this beneficial over 1-hot representation. (1 point)
Sketch the structure of an encoder-decoder architecture of neural MT, remember to describe the components in the picture (2 points)
What is the difference in RNN decoder application at training time vs. at runtime? (1 point)
What problem does attention in NMT address? Provide the key idea of the method. (1 point)
What problem/task do both RNN and self-attention resolve and what is the main benefit of self-attention over RNN? (1 point)
What are the three roles each state at a Transformer encoder layer takes in self-attention. (1 point)
What are the three uses of self-attention in the Transformer model? (1 point)
Provide an example of NMT improvement that was assumed to come from additional linguistic information but occurred also for a simpler reason. (1 point)
Summarize and compare the strategy of "classical statistical MT" vs. the strategy of neural approaches to MT. (1 point)

Homework assignments

There will be 3 homework assignments.
For each assignment, you will get points, up to a given maximum (the maximum is specified with each assignment).
All assignments will have a fixed deadline (usually in two weeks).
If you submit the assignment after the deadline, you will get:
- up to 50% of the maximum points if it is less than 2 weeks after the deadline;
- 0 points if it is more than 2 weeks after the deadline.
Once we check the submitted assignments, you will see the points you got and the comments from us in:
- Studijní mezivýsledky module in the Czech version of SIS
- Study group roster module in the English version of SIS
To be allowed to take the test (which is required to pass the course), you need to get at least 50% of the total points from the assignments.

Attendance to lectures is voluntary but recommended.
Attendance to practicals is mandatory. No more than three absences per semester will be allowed.

Exam test

There will be a written exam test at the end of the semester.
To pass the course, you need to get at least 50% of the total points from the test.
You can find a sample of test questions on the website; the list may be updated during the semester.

Grading

Your grade is based on the average of your performance; the exam test and the homework assignments are weighted 1:1.

≥ 90%: grade 1 (excellent)
≥ 70%: grade 2 (very good)
≥ 50%: grade 3 (good)
< 50%: grade 4 (fail)

For example, if you get 600 out of 1000 points for homework assignments (60%) and 36 out of 40 points for the test (90%), your total performance is 75% and you get a 2.

No cheating

Cheating is strictly prohibited and any student found cheating will be punished. The punishment can involve failing the whole course, or, in grave cases, being expelled from the faculty.
Discussing homework assignments with your classmates is OK. Sharing code is not OK (unless explicitly allowed); by default, you must complete the assignments yourself.
All students involved in cheating will be punished. E.g. if you share your assignment with a friend, both you and your friend will be punished.

Institute of Formal and Applied Linguistics

Charles University, Czech Republic
Faculty of Mathematics and Physics

Search form

NPFL124 – Natural Language Processing

About

Scheduled time and location

Lectures

License

1. Introduction

2. Language modeling.

3. Morphological analysis

4. Syntactic analysis

5. Information retrieval

6. Information retrieval, cont.

7. Introduction to Deep Learning in NLP

8. Deep learning applications in NLP

9. Machine translation

10. Machine translation, cont.

11. Overview of Language Data Resources

12. Evaluation measures in NLP

13. Early exam

Date: May 22, 2025

1. Language Identification

2. Information Retrieval

Step 1: Create Inverted Index

Step 2: Implement Boolean Query Operators

Step 3: Process Boolean Queries

Step 4: Evaluate Results

Step 5: Submission

3. Analysis of a Trained Model for Sentiment Classifation

Pool of possible exam questions

Homework assignments

Exam test

Grading

No cheating