The course is focused on practical exercises with applying machine learning techniques to real data. Students are expected to be familiar with basic machine learning concepts.
SIS code:
NPFL104
Semester: summer
E-credits: 5
Examination: 1/2 credit+exam
Whenever you have a question or need some help (and Googling does not work), contact us as soon as possible! Please always e-mail both of us.
To pass the course, you will need to submit homework assignments and do a written test. See Grading for more details.
Unless otherwise stated, teaching materials for this course are available under CC BY-SA 4.0.
2. Selected classification techniques hw_my_classif
3. Classification dataset preparation hw_my_dataset
4. Selected classification techniques, cont.
6. Scikit-learn hw_scikit_classif
7. Diagnostics & kernel methods cont. Slides (Diagnostics, Kernels Illustrated) Extra slides (Ng) Extra slides (Cohen) Reading: Bias-Variance Proof hw_gridsearch
8. Regression hw_my_regression
9. Feature engineering, Regularization Slides (Feature engineering) hw_scikit_regression
10. Clustering hw_my_clustering
12. Overall hw evaluation, solution highlights
Legend:
Feb 20
To provide students with an intensive practical experience on applying Machine Learning techniques on real data.
Until (time) exhausted, loop as follows:
Three blocks of classes corresponding to basic ML tasks...
... interleaved with classes on common topics such as
Let's borrow some googled materials:
Feb 27
We'll recall the task of classification, discuss a few selected basic classification techniques, and discuss their pros and cons:
Mar 6 Selecting topics for a homework task focused on creating a new classification dataset. hw_my_dataset
Mar 13
some more classification methods in detail:
multiclass classification
Mar 20
Mar 27
for the scikit-learn common estimator API see slide 47 in Introduction to Machine Learning with Scikitlearn
April 3
Slides (Diagnostics, Kernels Illustrated) Extra slides (Ng) Extra slides (Cohen) Reading: Bias-Variance Proof
April 10
April 17
April 24
Clustering evaluation:
Choosing the number of clusters:
May 15
May 22
Create your git repository at UFAL's redmine, follow these instructions, just replace npfl092
with npfl104
and 2017 with 2019
Continue practicing your knowledge of Python:
Implement at least 10 tasks from CodingBat (only from categories Warmup-2, List-2, or String-2).
Implement a simple class, anything with at least two methods and some data attributes (really anything).
For all tasks, add short testing code snippets into the respective source codes (e.g. count_evens.py should contain your implementation of count_evens function, as well as a short test checking the function's answer on at least one problem instance). Run them from a single Makefile: after typing 'make' we should see confirmations of correct functionality.
Submit the solutions into hw/python
in your git repository for this course.
Deadline: Mar 6 (23:59 Prague time, which applies for all deadlines below too)
Homework my-classifiers
:
Do-It-Yourself style.
Implement three classifiers in Python and evaluate their performance on given datasets.
Choose three classification techniques out of the following four: perceptron, Naive Bayes, KNN, decision trees.
You can use any existing libraries e.g. for data loading/storage or for other manipulation with data (including e.g. one-hot conversions), with the only exception: the machine learning core of each classifier (i.e., the training and prediction pieces of the code) must be written solely by yourself.
Apply the three classifiers on the following dataset and measure their accuracy (=percentage of correctly predicted instances):
a synthetic dataset (colored objects), a perfectly separble version and a version with added noice: artificial_objects.tgz
adult income prediction dataset from the Machine Learning Repository at UCI
Organize the execution of the experiments using a Makefile; 'make download' downloads the data files from the URLs given above; typing 'make perc' should run training and evaluation of perceptron (if it's in your selection) for both datasets and print out the final accuracy for both (while other output info is stored in perc.log file); 'make nb', 'make knn', make 'dt' should work analogously (three of four are enough), 'make all' should call data download and all your three classification targets.
After finishing the task, store it into hw/my-classifiers
in your git repository for this course.
Please double-check that 'make all' works in a fresh git clone in the deafult SU1 environment (you can access the SU1 computers remotely by ssh).
Deadline: Mar 13
In your repository, create this structure of directories:
hw/my-dataset/ob-sample/ # one subdir per dataset variation
hw/my-dataset/ob-sample-with-derived-features/ # another variation
The name of your dataset should be the one listed in the ML Methods Datasets 2019 Google Sheet.
Within each dataset directory (ob-sample
or ob-sample-with-derived-features
in the example here), provide these files:
train.txt.gz
, test.txt.gz
[compulsory] the training and test data
Both files have to be gzipped plaintexts, comma-delimited, with unix line breaks.
The class label has to be in the last field of each line. (Use class labels relevant for your dataset, i.e. symbolic names such as Good, Bad. Avoid meaningless numbers.)
...this means that your dataset already has to be converted to reasonable features. If there are more sensible ways to featurize your dataset, feel free to create more dataset variations.
If your data is over ~7MB, please commit a Makefile, not the actual data. Running make
in your dataset directory should download/obtain these files from somewhere.
header.txt
README
corrplot.pdf
or histogram-of-labels.pdf
or similar
[compulsory] Choose one way of visualizing the data to give a quick overview of it.
Some examples on visualization in Python will be added in the coming days.
Deadline: Mar 25
The "do-it-well" version of my-classifiers
homework
Apply scikit-learn classification modules on all the datasets collected by you and your colleagues (each student has to do this homework on his own, on all datasets).
data/DATASET-NAME
.You need to submit both your script that runs it and the scores that you achieved:
hw/scikit-classifiers
.
make
in that directory will get the data and run all the classifiers.CLASSIFICATION_RESULTS.txt
in the shared repository.
check-classification-results.py < CLASSIFICATION_RESULTS.txt
before pushing the file.Deadline: April 24
homework classification-gridsearch
Use the dataset PAMAP-Easy from the shared data repository:
For PAMAP-Easy as divided into train+test:
GridSearchCV
to find the best C and gamma (i.e. find the best without plotting anything).test.txt.gz
FOR THE GRIDSEARCH.test.txt.gz
to CLASSIFICATION_RESULTS.txt
, mention C and gamma in the comment.Deadline: April 24
homework my-regression
do-it-yourself style task: implement any regression technique (e.g. least squares by stochastic gradient descent) and apply it on the following datasets:
artificial data y(x) = 2*x + N(0,1): artificial_2x_train.tsv artificial_2x_test.tsv
prices of flats in Prague, given their area (m2), type of construction, type of ownership, status, floor, equipment, cellar, and balcony: pragueestateprices_train.tsv pragueestateprices_test.tsv
organize the execution of the experiment using a Makefile, typing make all
should train and evaluate (e.g. via mean square error) models for both datasets
store your solution into hw/my-regression
directory in the git repository
Deadline: May 1
The Do-it-well step
Apply at least three scikit-learn regression modules on the datasets from the previous class on regression. You can use e.g. modules for Generalized Linear Models, Support Vector regressors, KNN regressors, Decision Tree regressors, or any other.
make your solution (i.e. training and evaluation (e.g. via mean square error) on both datasets) runnable just by typing 'make'
submit your solution into
into hw/scikit-regression/
in your
git repository.
Deadline: May 19
Homework my-clustering
Do-It-Yourself style
implement the K-Means algorithm in a Python script that reads data in the PAMAP-easy format and clusters the data. At the end, the script should print a summary table like the one in this example.
Dataset for the homework: PAMAP-Easy from the shared data repository:
Commit your script and the Makefile into the my-clustering/
directory in the usual place.
As usual, please double-check that typing make
in this directory in a fresh clone works in the default SU1 environment.
Deadline: May 8
Describe the task of classification. (1 point)
Describe linear classification models. (1 point)
Explain how perceptron classifiers work. (2 points)
Explain how Naive Bayes works. (2 points)
Explain how KNN works. How would you choose the value of K? (3 points)
Explain how decision trees work. Give examples of stopping criteria. (3 points)
Explain how SVM works. (3 points)
Compare pros and cons of any 2 classification methods which you choose out of the following 6 methods: perceptron, Naive Bayes, KNN, decision trees, SVM, Logistic Regression (3 points)
Explain the notion of separation boundary. (1 point)
What does it mean that a classification dataset is linearly separable. (1 point)
Assume there are two classes of points in 2D in the training data: A = {(0,0),(1,1)} and B = {(3,1),(2,3),(2,4)}. Could you sketch separation boundaries that would be found by (a) SVM, (b) perceptron, (c) 1-NN? (3 points)
How would you use a binary classifier for a multiclass classification task? (2 points)
Give two examples of "native multiclass" classifiers and two examples of "native binary" classifiers. (3 points)
What would you do if you are supposed to solve a classification task whose separation boundary is clearly non-linear (e.g. you know that it's ball-shaped)? (2 points)
Your grade is based on the average of your performance; the test and the homework assignments are weighted 1:1.
For example, if you get 600 out of 1000 points for homework assignments (60%) and 36 out of 40 points for the test (90%), your total performance is 75% and you get a 2.