The seminar focuses on deeper understanding of selected unsupervised machine learning methods for students who already have basic knowledge of machine learning and probability theory. The course covers the following topics: Latent Variable models, Categorical Mixture Models, Latent Dirichlet Allocation, Gaussian Mixture Models, Expectation-Maximization, Gibbs Sampling, Chinese Restaurant Process, Pitman-Yor Process, Hierarchical Clustering, Clustering Evaluation, Principal Component Analysis, T-SNE, Sparse Auto-Encoders.
SIS code: NPFL097
Semester: winter
E-credits: 3
Examination: 1/1 C
Guarantor: David Mareček
Students are expected to be familiar with basic probabilistic concepts, roughly in the extent of:
In the second half of the course, it will be an advantage for you if you know the basics of deep-learning methods. I recommend to attend
2. Beta-Bernoulli and Dirichlet-Categorical probabilistic models Beta-Bernoulli Dirichlet-Categorical
3. Modeling document collections, Mixture of Categoricals, Expectation-Maximization Document collections Mixture of Categoricals
4. Bayesian Mixture Models, Latent Dirichlet Allocation, Gibbs Sampling Latent Dirichlet Allocation Gibbs Sampling for LDA
5. Gibbs Sampling for LDA, Entropy, Assignment 1 Gibbs Sampling for LDA Latent Dirichlet Allocation Gibbs Sampling for Mixture of Gaussians
6. Chinese Restaurant Process, Unsupervised Text Segmentation Chinese Restaurant Process Bayessian inference with Tears (by K.Knight) Text Segmentation
7. Gibbs Sampling method and Traditional NLP tasks Traditional_NLP_Tasks
8. K-Means clustering, Mixture of Gaussians K-Means and Gaussian Mixture Models
9. Aglomerative Clustering, Evaluation methods Aglomerative Clustering and Clustering Evaluation
10. Dimesionality Reduction Dimensionality Reduction t-SNE and PCA demo
12. Sparse Auto-Encoders, Interpretation of Neural Language Models
Unless otherwise stated, teaching materials for this course are available under CC BY-SA 4.0.
Oct 1
Oct 8
Oct 15
Oct 22
Oct 29
Nov 12
Nov 19 Traditional_NLP_Tasks
Nov 26
Dec 3
Dec 10
Dec 17
Jan 7
Deadline: Nov 19, 23:59 10 points
Deadline: Dec 10 23:59 10 points
Define Beta distribution, describe its parameters. Plot (roughly) the following distributions: Beta(1,1), Beta(0.1,0.1), Beta(10, 10).
Derive the posterior distribution from the prior (Beta distribution) and likelihood (Binomial distribution). Derive the predictive distribution for the Beta-Bernoulli posterior.
Explain Dirichlet distribution, describe its parameters. Plot (roughly) the following distributions: Dir(1,1,1), Dir(0.1,0.1,0.1), Dir(10, 10, 10).
Derive the posterior distribution from the prior (Dirichlet distribution) and likelihood (Multinomial distribution). Derive the predictive distribution for the Dirichlet-Categorical posterior.
Explain the "Mixture of Categoricals" model (a topic is assigned to each document) for Modeling document collections. Describe all its parameters and hyperparameters. From what distributions are they drawn? Describe the Expectation-Maximization algorithm for training such model.
Explain the Latent Dirichlet Allocation model (a topic is asigned to each word in each document). Describe all its parameters and hyperparameters. From what distributions are they drawn? What are the latent variables? Describe the learning algorithm based on Gibbs sampling.
Explain Collapsed Gibbs sampling. Choose one unsupervised task from the lectures (word alignment, tagging, segmentation) and describe the basic algorithm. What is annealing?
Explain Chinese Restaurant Process. What distributions does it generate? What is exchangeability? Explain its generalization to the Pitman-Yor process.
Explain the Gaussian Mixture model for clustering. What are the advantages of Gaussian Mixture model compared to K-means? Provide an example of clusters in 2D where K-means fails and where Gaussian Mixture model works well.
Explain Hierarchical Agglomerative clustering methods. What are their advantages over K-means? What linkage criteria do you know? Provide examples of clusters in 2D where these criteria fail.
What is t-SNE? What properties does it have? What is it used for? How does it work?
What is Principal Component Analysis? What properties does it have? What is it used for? How does it work? Explain it in a 2D example.
Christopher Bishop: Pattern Recognition and Machine Learning, Springer-Verlag New York, 2006 (read here)
Kevin P. Murphy: Machine Learning: A Probabilistic Perspective, The MIT Press, Cambridge, Massachusetts, 2012 (read here)
David Mareček, Jindřich Libovický, Tomáš Musil, Rudolf Rosa, Tomasz Limisiewicz: HIDDEN IN THE LAYERS: Interpretation of Neural Networks for Natural Language Processing. Institute of Formal and Applied Linguistics, 2020 (read_here)