The aim of the course is to get students familiar with basic software tools used in natural language processing. The classes combine lectures and practicals.
SIS code:
NPFL125
Semester: winter
E-credits: 3
Examination: 0/2 MC (KZ)
Whenever you have a question or need some help (and Googling does not work), contact us as soon as possible! Please always e-mail both of us.
To pass the course. you will need to submit homework assignments and do a written test. See Grading for more details.
Unless otherwise stated, teaching materials for this course are available under CC BY-SA 4.0.
1. Introduction; Survival in Linux, Bash UNIX (Czech) Bash (English) hw_ssh Questions
2. Editors, Git, Recodex A Visual Introduction to Git git-scm: About Version Control tryGit Python in Atom Getting Started with VS Code Branching hw_git hw_hello Questions
3. Bash and Makefiles Unix for Poets hw_makefile Questions
4. Encoding Encoding Questions
5. Finishing up encodings; Python, basic manipulation with strings Python: Introduction for Absolute Beginners hw_python Questions
6. Python: strings cont., I/O basics, regular expressions; modules, packages, classes Strings Python tutorial Unicode Text files Regexes Reading hw_string hw_tagger Questions Questions
7. A gentle introduction to XML and JSON XML XML&JSON hw_xml2json Questions
8. Spacy, NLTK and other NLP frameworks hw_frameworks Questions
9. REST API hw_rest Building REST APIs Questions
10. Data processing, pandas V4PY online course
11. Visualization in Matplotlib
Oct 1, 2024
Introduction
keyboard shortcuts in KDE/GNOME, selected e.g. from here
motivation for scripting, command line features (completion, history...), keyboard shortcuts
bash in a nutshell
bit.ly/2hQQeTH
remote access to a unix machine: SSH (Secure Shell)
you can access a lab computer e.g. by opening a unix terminal and typing:
ssh yourlogin@u-pl1.ms.mff.cuni.cz
Replace yourlogin
with your login into the lab and type your SIS
password when asked for it. You can find your login in SIS as shown at
the image below (note that you need to lowercase your login for SSH):
as of October 2024, you can access u-pl0
and u-pl1
machines, and at night also u1-1
to u1-15
and u2-1
to u2-25
(or something like that); but some machines might
be offline or rebooted to Windows and thus inaccessible so it pays off
trying out multiple machines
your home is shared across all the lab computers in all the MS labs (SU1, SU2, Rotunda), i.e. you will see your files everywhere
you can ssh even from non-unix machines
cmd
, press Enter, and
now a commandline opens up in which you can use ssh directlyon Windows, you can try using the Windows Terminal
Oct 8, 2024 A Visual Introduction to Git git-scm: About Version Control tryGit
Python in Atom Getting Started with VS Code
requirements on a modern source-code editor
fallback mode for working in a text console OR a different editor to use in the text console
you can use any editor you like, as long as it has the capabilities listed above and you know how to use them
if you don't have a favourite Linux editor yet, we suggest e.g. atom or Visual Studio Code; bot Atom and VS Code is installed in the labs, and is cross-platform, i.e. you can also use it on Windows and Mac
for a text-mode editor (without a graphical user interface, e.g. for working
through ssh
), we suggest nano
other good editors include e.g. Sublime (cross-platform); for Windows, e.g. Notepad++ and PSPad are good
for using emacs (if you really want to): look here
for using vim (if you really want to): run the vimtutor
command to go through an
introductory tutorial of using vim (vimtutor english
to
run the English version of the tutorial)
hw_hello
/afs/ms/doc/vyuka/INCOMING/TechnoNLP/
).Note: This "first setup" section and the concrete URLs are specific to the faculty GitLab and to our course. However, you would use a similar approach for your own project hosted either at the faculty GitLab or at another similar service which hosts Git repositories, such as GitHub, public GitLab, or BitBucket.
NPFL125
npfl125
Private
https://gitlab.mff.cuni.cz/yourlogin/yourprojectslug.git
https://gitlab.mff.cuni.cz/rosar7am/npfl125.git
Rudolf Rosa
Reporter
(this is for read-only access)Zdeněk Žabokrtský
Note: the rest of the instructions is generally valid for working with any Git anywhere.
cd
git clone https://gitlab.mff.cuni.cz/yourlogin/npfl125.git
(or whatever your repo URL is)cd npfl125
(or whatever your repo ID is)goodbye.sh
file
echo 'echo Goodbye cruel world' > goodbye.sh
git status
git add goodbye.sh
git status
git commit -m'Goodbye, all you people'
git status
git push
git status
cd
mkdir npfl125
cd npfl125
git init
goodbye.sh
file
echo 'echo Goodbye cruel world' > goodbye.sh
git status
git add goodbye.sh
git status
git commit -m'Goodbye, all you people'
git status
git remote add origin https://gitlab.mff.cuni.cz/yourlogin/npfl125.git
(or whatever your repo URL is)git push -u origin master
git status
cd; mkdir new_clone_of_repo; cd new_clone_of repo
git clone https://gitlab.mff.cuni.cz/yourlogin/npfl125.git
(or whatever your repo URL is)cd npfl125
echo 'This repo will contain my homework.' >> README
git add README
git commit -m'adding more info'
git push
cd ~/npfl125
cat README
git pull
cat README
git clone
if on a different computer)git pull
git status
(and with git diff
to see changes inside files)git add
(untracked files are ignored by git)git commit
git push
git checkout filename
to revert to the last committed version of file filenamegit log
to figure out which commit you are interested ingit show commitid
to show the details about a commit with id commitidgit checkout commitid
to go to the state after the commit commitidgit checkout master
to return to the current stategit branch branchname
to create a new branch called branchnamegit checkout branchname
to switch to the branch branchnamegit checkout master
to switch back to mastergit merge branchname
to merge branch branchname into the current branch
git checkout master
git merge branchname
git branch -d branchname
to remove the branch called branchnamescp
)Oct 15, 2024
exercise: maybe again playing with text files udhr.zip,
also available for download at bit.ly/2hQQeTH
; or any larger UTF8 text,
e.g. from Project Gutenberg
.bashrc
Bash scripting
text processing commands: sort, uniq, cat, cut, [e]grep, sed, rev, diff, pipelines, man...
regular expressions
if, while, for
variables (a=abcde; echo ${a:1:2}
)
xargs: Compare
sed 's/:/\n/g' <<< "$PATH" | \
grep $USER | \
while read path ; do
ls $path
done
with
sed 's/:/\n/g' <<< "$PATH" | \
grep $USER | \
xargs ls
Makefiles
make at Wikipedia
Makefile tutorial
very simple Makefile sample (from the lesson): Makefile
variables:
a=5
, name=Rudolf
, echo $a
, echo ${name} $(name)
a:=$b
expanded when you define itb=$a
expanded when you use itSHELL=/bin/bash
unless you really want to use pure sh$$a
to
refer to Bash variable a
)target dependencies (print_file_a
first invokes create_file_a
):
create_file_a:
echo Hello > file_a
print_file_a: create_file_a
cat file_a
targets as recipes for creating files (the target name corresponds to the name of the file being created); the code is not run if the file already exists and is up to date:
file_a:
echo Hello > file_a
or with Makefile automatic variables ($@
corresponds to the name of the
current target):
file_a:
echo Hello > $@
target dependencies done in a clever way (print_file_a
depends on the
extence of file_a
; if it already exists, it is just printed, if it
does not exist, it is first created by first invoking the target file_a
and then printed):
file_a:
echo Hello > file_a
print_file_a: file_a
cat file_a
or with Makefile automatic variables ($<
is the first
prerequisite):
file_a:
echo Hello > $@
print_file_a: file_a
cat $<
warm-up exercises:
rev
for reverting
individual lines)sed 's/./&\t/g | rev | cut -f2,3,4 | rev
for extracting the
last three letters)system variables
editting .bashrc
(aliases, paths...)
looping, branching, e.g.
#!/bin/bash
for file in *; do
if [ -x $file ]
then
echo Executable file: $file
echo Shebang line: `head -n 1 $file`
echo
fi
done
leftover from previous class (web page encoding)
Coding examples shown during class: some Bash and Makefile stuff from 2019: bash history
Some more exercises (in all cases, expect file input.txt containing an English text as the input):
grep
to count all occurences of a specific word (i.e., the number of all occurrences, not the number of lines containing the word)grep
to print line numbers in which a specific word appears.sed
to replace all occurrences of a specific word with its capitalized form.sed
to make all occurences of a specific word bold using the HTML markup, or using markdown.sed
, remove all punctuation from the text.Oct 22, 2024
Oct 29, 2024 Python: Introduction for Absolute Beginners
Go to Google Colab, which is web service where you can directly write and run Python code
Alternatively, go e.g. to Kaggle
Click "New notebook" or "File > New notebook" to create a new interactive Python session
You will see an empty text field; this is a code field
Hello world
print("Hello world!")
into the code fieldShift+Enter
Hello world
, printed
below the code fieldBasic mathematics
print(20+3)
print(20*3)
print(20/3)
, print(20//3)
, print(20-3)
Fun with strings (a string is a piece of text)
print("I like " + "apples")
print(10 * "apple")
A multiline code
a = 5
b = 10
c = a + b
print(c)
Now try to write something yourself
a
or mynumber
or big_fat_elephant
"Hello"
and "world"
), add
them together and print out the result
"Hello"
or 'Hello'
is
a stringHello
could be
variable namehello = "Hello"
to store the string
Hello into a variable called helloCtrl+F
to
search for stuff)me = "Rudolf"
and you
for the other name) and print
out a greeting (e.g. print("Hello " + me + "Hi " + you)
)To solve practical tasks, Google is your friend!
By default, we will use Python version 3: python3
To create a Python script (needed e.g. to submit homework assignments):
Create a PY textfile in your favourite editor (e.g. myscript.py
)
Put a correct Python 3 she-bang on the first line (so that Bash knows to run the file as a Python script), and your code on subsequent lines, so the file may look e.g. like this:
#!/usr/bin/env python3
print("Hello world")
Save the script
Make it executable: chmod u+x myscript.py
Run it (in the terminal): ./myscript.py
or python3 myscript.py
To work interactively with Python, you can use Google Colab
>
button or Ctrl+Enter
or
Shift+Enter
(recommeded: runs the code and creates a new code field)5+5
prints out 10
in interactive Pythonprint()
function
5+5
does "nothing" in a Python scriptprint(5+5)
prints 10
For offline interactive working with Python in the terminal, you can simply run python3
and start typing commands
A slightly more friendly version is IPython: ipython3
to save the commands 5-10 from your IPython session to a file named mysession.py
, run:
%save mysession 5-10
to exit IPython, run:
exit
To install missing modules (maybe ipython might be missing), use pip (in Bash):
pip3 install --user ipython
For non-interactive work, use your favourite text editor.
Python types
a = 1
a = 1.0
a = True
a = '1 2 3'
or a = "1 2 3"
a = [1, 2, 3]
a = {"a": 1, "b": 2, "c": 3}
a = (1, 2, 3)
(something like a fixed-length immutable list)Create a string containing the first chapter of genesis. Print out first 40 characters.
str[from:to] # from is inclusive, to is exclusive
Print out 4th to 6th character 1-based (=3rd to 5th 0-based)
Check the length of the result using len()
.
Split the string into tokens (use str.split()
; see ?str.split
for help).
Print out first 10 tokens. (List splice behaves similarly to substring.)
Print out last 10 tokens.
Print out 11th to 18th token.
Check the length of the result using len()
.
Just printing a list splice is fine; also see ?str.join
Compute the unigram counts into a dictionary.
# Built-in dict (need to explicitly initialize keys):
unigrams = {}
# The Python way is to use the foreach-style loops;
# and horizotal formatting matters!
for token in tokens:
# do something
# defaultdict, supports autoinitialization:
from collections import defaultdict
# int = values for non-set keys initialized to 0:
unigrams = defaultdict(int)
# Even easier:
from collections import Counter
Print out most frequent unigram.
max(something)
max(something, key=function_to_get_key)
# getting value stored under a key in a dict:
unigrams[key]
unigrams.get(key)
Or use Counter.most_common()
Print out the unigrams sorted by count.
Use sorted()
— behaves similarly to max()
Or use Counter.most_common()
Get unigrams with count > 5; can be done with list comprehension:
[token for token in unigrams if unigrams[token] > 5]
Count bigrams in the text into a dict of Counters
bigrams = defaultdict(Counter)
bigrams[first][second] += 1
For each unigram with count > 5, print it together with its most frequent successor.
[(token, something) for …]
Print the successor together with its relative frequency rounded to 2 decimal digits.
max(), sum(), dict.values(), round(number, ndigits)
Print a random token. Print a random unigram disregarding their distribution.
import random
?random.choice
list(dict.keys())
Pick a random word, generate a string of 20 words by always picking the most frequent follower.
range(10)
Put that into a function, with the number of words to be generated as a parameter.
Return the result in a list.
list.append(item)
def function_name (parameter_name = default):
# do something
return 123
Sample the next word according to the bigram distribution
import numpy as np
?np.random.choice
np.random.choice(list, p=list_of_probs)
Everything I showed interactively in the class in 2020 can be found in python_intro.ipynb
Commands from the interactive session from 2019: first_python_exercises.py
A sample solution to the exercises 1 to 13 can be found in solution_1.py
Nov 12, 2024
Individual preparation before the class (45 minutes at most):
Python strings resemble lists in some aspects, for instance we can access individual characters using their positional indices and the bracket notation...
greeting = "hello"
print(greeting[0])
greeting[0] = "H"
... wait, the last line causes an error! Why is that?
If the distinction mutable vs. immutable is new to you, please read e.g. Mutability & Immutability in Python by Chetan Ambi. Please be ready to explain the distinction at the beginning of the class.
If you know the distinction already, explain why a repeated string concatenation like the following one is a bad idea in Python
s = ''
for i in range(n):
s = str(i) + s
Ideally explain it in terms of the big O notation.
How would you handle similar situations in which repeated string accumulation is needed.
str.*
: useful methods you can invoke on a string
lower
, upper
, capitalize
, title
, swapcase
)is*
tests (isupper
, isalnum
...)find
, startswith
, endswith
, count
, replace
)split
, splitlines
, join
dir
, sorted
, set
list comprehension
a very pythonic way of creating lists using functional programming concepts:
words = [word.capitalize() for word in text if len(word) > 3]
equivalent to:
words = []
for word in text:
if len(word) > 3:
words.append(word.capitalize())
reading in data
opening a file using its name
fh = open('file.txt')
text = fh.read()
lines = fh.readlines()
for line in fh: print(line.rstrip())
read from standard input (cat file.txt | ./process.py
or
./process.py < file.txt
)
import sys
for line in sys.stdin:
print(line, end='')
Python has built-in regex support in the re
module, but the regex
module seems to be more powerful while using the same API. To be able
to use it, you need to:
install it (in Bash):
pip3 install --user regex
import in (in Python)
import regex as re
search
, findall
, sub
raw strings r'...'
character classes [[:alnum:]], \w
, ...
flags flags=re.I
or r'(?i)...'
subexpressions r'(.) (...)'
+ backreferences r'\1 \2'
revision of regexes
^[abc]*|^[.+-]?[a-f]+[^012[:alpha:]]{3,5}(up|down)c{,5}$
good text to play with: the first chapter of genesis again
unfiltered ipython3 session on RE from 2024: regex_2024_11_21.py
unfiltered ipython3 session on strings, I/O and RE from the zoom online class 2020: stringsession.py
regex ipython3 session from the lab (unfiltered, from a lab taught in year 2016)
Python types
a = 1
a = 1.0
a = True
a = '1 2 3'
or a = "1 2 3"
a = [1, 2, 3]
a = {"a": 1, "b": 2, "c": 3}
a = (1, 2, 3)
(something like a fixed-length immutable list)Create a string containing the first chapter of genesis.
Split the string into tokens (use str.split()
; see ?str.split
for help).
Compute the unigram counts into a dictionary.
# Built-in dict (need to explicitly initialize keys):
unigrams = {}
# The Python way is to use the foreach-style loops;
# and horizotal formatting matters!
for token in tokens:
# do something
# defaultdict, supports autoinitialization:
from collections import defaultdict
# int = values for non-set keys initialized to 0:
unigrams = defaultdict(int)
# Even easier:
from collections import Counter
Print out most frequent unigram.
max(something)
max(something, key=function_to_get_key)
# getting value stored under a key in a dict:
unigrams[key]
unigrams.get(key)
Or use Counter.most_common()
Before the class: think up how we would deal with words
Classes in Python
class Word:
, w = Word()
, w.form = "help"
, def foo(self, x, y):
, self.form
, a.foo(x, y)
def __init__(self, x, y)
, def __str__(self)
, from Module import Class
, if __name__ == "__main__":
.py
file; you can just import the module, or even
import specific classes from the moduleimport MyModule as SomeOtherName
class B(A)
; overriding is the default, just redefine the
method; use super().foo()
to invoke parent's implementationself
, belong to class) -- feel free to ignore
this and just use non-static members only, mostly this is fine...
class A
, a = 5
, A.a = 10
, def b(x, y)
, A.b(x, y)
packA/modA.py
, packA/modB.py
, from packA.modB import classC
...Pickle: simple storing of objects into files (and then loading them again)
Python has a simple mechanism of storing any object (list, dict, dict of lists, any object you defined, or really nearly anthing) into special binary files.
To store an object (e.g. the list my_list
), use pickle.dump()
:
my_list = ['hello', 'world', 'how', 'are', 'you?']
import pickle
# Need to open the file for writing in binary mode
with open('a_list.pickle', 'wb') as pickle_file:
# Store the my_list object into the 'a_list.pickle' file
pickle.dump(my_list, pickle_file)
A file a_list.pickle
gets created with some unreadable binary data (next week, we get
to ways of storing data in a more readable way).
However, for Python, the data is perfectly readable, so you can easily load your object like this (i.e. you can put this code into another Python script and run it like next day when you need to get back your list):
import pickle
# This time need to open the file for reading in binary mode
with open('a_list.pickle', 'rb') as the_file:
the_list = pickle.load(the_file)
# And now you have the list back!
print(the_list)
print(the_list[3])
Virtual environments
Once for each project, create a venv for the project; specify any path you like to store the environment:
python3 -m venv ~/venv_proj_A
Every time you start working on project A, switch to the right venv:
source ~/venv_proj_A/bin/activate
Checking that everything looks fine:
(venv_proj_A)
python
and python3
should now be local just for this
venv:
which python
and which python3
/home/rosa/venv_proj_A/bin/python3
pip
and pip3
should be identical):
which pip
should say something like
/home/rosa/venv_proj_A/bin/pip
pip --version
should mention python 3To install Python packages just for this project:
pip install package_name
(instead of the usual
pip3 install --user package_name
)To get out of the venv:
deactivate
Exercise: implement a simple Czech POS tagger in Python, choose any approach you want, required precision at least 50%
Tagger input format - data encoded in iso-8859-2 in a simple line-oriented plain-text format: empty line separate sentences, non-empty lines contain word forms in the first column and simplified (one-letter) POS tag in the second column, such as N for nouns or A for adjectives (you can look at tagset documentation). Columns are separated by tabs.
Tagger output format: empty lines not changed, nonempty lines enriched with a third column containing the predicted POS for each line
Training data: tagger-devel.tsv
Evaluation data: tagger-eval.tsv (to be used only for evaluation!!!)
Performance evaluation (precision=correct/total): eval-tagger.sh_
cat tagger-eval.tsv | ./my_tagger.py | ./eval-tagger.sh
Example baseline solution - everything is a noun, precision 34%:
python -c'import sys;print"".join(s if s<"\r" else s[:-1]+"\tN\n"for s in sys.stdin)'<tagger-eval.tsv|./eval-tagger.sh
prec=897/2618=0.342627960275019
a simple rule: use Unicode everywhere, and if conversions from other encodings are needed, then do them as close to the physical data as possible (i.e., encoding should processed properly already in the data reading/writing phase, and not internally by decoding the content of variables)
example:
f = open(fname, encoding="latin-1")
sys.stdout = codecs.getwriter('utf-8')(sys.stdout)
sys.stdout.reconfigure(encoding='utf-8')
Regexes Reading hw_string hw_tagger Questions Questions
Nov 19, 2024 XML
Motivation for XML, basics of XML syntax, examples, well-formedness/validity, dtd, xmllint
Optional exercise before the class (45 minutes at most)
Let's have a look at what can go wrong in XML files (in the sense of violating XML syntax).
Download the collection of toy examples of correct and incorrect XML files: xml-samples.zip
If you are familiar with XML already, then
If XML is new to you, then
Optional exercise: from HTML to XML:
Optional exercise (45 minutes at most)
wget http://ufal.mff.cuni.cz/courses/npfl092
xmllint --noout npfl092
Try to fix as many violations of the rules in the HTML file as you can in the given time, by any means (either manually or e.g. by Python regular expressions). Can you turn the HTML file to a completely well-formed XML file?
In order to avoid any confusion: this is just an exercise, XML and HTML are only cousins, and HTML files are usually not required nor expected to be well-formed XML files; HTML validity can be checked by some other tools such as by the W3C Markup Validation Service. You can check the HTML-validity of the course web page too if time remains.
We'll briefly discuss Mardown, which is a markup language too. You can play with an online Markup-to-HTML converter.
Optional reading:
Nov 29, 2023
It seems to be a good idea to first create a virtual environment for that and activate it, e.g. (in Bash):
python3 -mvenv venv_spacy
source venv_spacy/bin/activate
Install Spacy -- in Bash (use install --user
if not in a virtual
environment):
pip install spacy
Install Spacy English model -- in Bash:
python -m spacy download en_core_web_sm
# or: pip install en_core_web_sm
Optionally, also install Spacy models for some other language(s) of your interest. Some 25 languages are directly available within Spacy: https://spacy.io/usage/models Often there are multiple models of multiple sizes.
Install NLTK -- in Bash (again, maybe better to use a separate venv
; then
there is no need for --user
):
pip3 install --user nltk
Install NLTK data and models -- in Python:
import nltk
nltk.download()
# usually, you should chose to download "all" (but it may get stuck)
I have not tested everything on Google Colab. Spacy seems to be installed including at least some of the models. NLTK seems to be installed without models and data, so these have to be downloaded. Nevertheless, please also try to install everything on your machine if possible; and definitely on the remote lab machine so that you can test stuff there.
How is it better than other options, i.e. manual implementation or using existing standalone tools? (Note: the benefits of using a framework listed below are not necessarily true for all frameworks.)
In Bash (install Spacy and English model):
pip3 install --user spacy
python3 -m spacy download --user en_core_web_sm
# or: pip3 install --user en_core_web_sm
All officially available models: https://spacy.io/usage/models Sometimes there are multiple models of multiple sizes. For other languages, you have to find or create a model.
In Python (import spacy and load the English model):
import spacy
nlp = spacy.load("en_core_web_sm")
Create a new document:
doc = nlp("The duck-billed platypus (Ornithorhynchus anatinus) is a small mammal of the order Monotremata found in eastern Australia. It lives in rivers and on river banks. It is one of only two families of mammals which lay eggs.")
The document is automatically processed (tokenized, tagged, parsed...)
list(doc)
for token in doc:
print(token.text)
# or simply: print(token)
for token in doc:
print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.shape_, token.is_alpha, token.is_sent_start, token.is_stop, sep='\t')
for sentence in doc.sents:
print(sentence, sentence.root)
for ent in doc.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)
list(doc.noun_chunks)
Spacy can also do visualisations:
from spacy import displacy
displacy.serve(doc, style="dep")
displacy.serve(doc, style="ent")
It seems on Windows, in some cases, Displacy says the visualisation server is running on
http://0.0.0.0:5000
but actually it is on http://localhost:5000
Larger models also contain word embeddings and can do word similarity: https://spacy.io/usage/spacy-101#vectors-similarity
wordform POStag
separated by a tab, with an empty line separating sentencesInstallation:
# in terminal
pip3 install --user nltk
ipython3
import nltk
# optionally:
# nltk.download()
# usually, you should chose to download "all" (but it may get stuck)
A very similar tutorial to what we do in the class is available online at Dive Into NLTK; we mostly cover the contents of the parts I, II, III and IV.
Sentence segmentation, word tokenization, part-of-speech tagging, named entity recognition. Use genesis or any other text.
text = """The duck-billed platypus (Ornithorhynchus anatinus) is a small
mammal of the order Monotremata found in eastern Australia. It lives in
rivers and on river banks. It is one of only two families of mammals which
lay eggs."""
# or use e.g. Genesis again
# with open("genesis.txt", "r") as f:
# text = f.read()
sentences = nltk.sent_tokenize(text)
# just the first sentence
tokens_0 = nltk.word_tokenize(sentences[0])
tagged_0 = nltk.pos_tag(tokens_0)
# all sentences
tokenized_sentences = [nltk.word_tokenize(sent) for sent in sentences]
tagged_sentences = nltk.pos_tag_sents(tokenized_sentences)
ne=nltk.ne_chunk(tagged_0)
print(ne)
ne.draw()
wordform POStag
separated by a tab, with an empty line separating sentencesLet's create a simple constituency tree for the sentence A red bus stopped suddenly:
# what we want to create:
#
# S
# / \
# NP VP
# / | \ / \
# A red bus stopped suddenly
#
from nltk import Tree
# Tree(root, [children])
np = Tree('NP', ['A', 'red', 'bus'])
vp = Tree('VP', ['stopped', 'suddenly'])
# children can be strings or Trees
s = Tree('S', [np, vp])
# print out the tree
print(s)
# draw the tree (opens a small graphical window)
s.draw()
And a dependency tree for the same sentence:
# what we want to create:
#
# stopped
# / \
# bus suddenly
# / |
# A red
# can either use string leaf nodes:
t1=Tree('stopped', [Tree('bus', ['A', 'red']), 'suddenly'])
t1.draw()
# or represent each leaf node as a Tree without children:
t2=Tree('stopped', [Tree('bus', [ Tree('A', []), Tree('red', []) ]), Tree('suddenly', []) ])
t2.draw()
Note: some of the frameworks/toolkits are in (very) active development; therefore, the information listed here may easily fall out of date.
Dec 6, 2023
A peek into requests library and REST APIs
getting resources from the internet
static resources (using Python instead of wget
)
import requests
url = 'http://p.nikde.eu'
response = requests.get(url)
response.encoding='utf8'
print(response.text)
dynamic resources provided through a REST API; for a given REST API you want to use, look for its documentation on its website
# you need to find out what the URL of the endpoint is
url = 'http://lindat.mff.cuni.cz/services/translation/api/v2/models/en-cs'
# you need to find out what parameters the API expects
data = {"input_text": "I want to go for a beer today."}
# sometimes, you may need to specify some headers (often not necessary)
headers = {"accept": "text/plain"}
# some APIs support `get`, some support `post`, some support both
response = requests.post(url, data = data, headers = headers)
to tokenize, tag and parse a short English text, you can run curl
directly in the terminal (--data
specifies data to send via the POST
method; to use GET, you would put the parameters directly into the URL):
curl --data 'model=english&tokenizer=&tagger=&parser=&data=Christmas is coming! Are you ready for it?' http://lindat.mff.cuni.cz/services/udpipe/api/process
to print out the result as plaintext, you can pipe it to:
python -c "import sys,json; print(json.load(sys.stdin)['result'])"
to perform only sentence-segmentation and tokenization, use only the
tokenizer=
processor (no tagger and parser), and set
output=horizontal
so you can use REST APIs directly from the terminal; but it is probably more comfortable from Python
use the requests
module, which has a get()
function (as well as a
post()
function); provide the URL of the API, and the parameters (if
any) as a dictionary:
import requests
url = 'http://lindat.mff.cuni.cz/services/udpipe/api/process'
params = dict()
params["model"] = "english"
params["tokenizer"] = ""
params["tagger"] = ""
params["parser"] = ""
params["data"] = "Christmas is coming! Are you ready for it?"
response = requests.get(url, params)
the response
contains a lot of fields, the most important being text
,
which contains the content of the response;
often (but not always) it is in JSON, so you might want to load it using
json.loads()
, but you can also get it directly using .json()
:
# the "raw" response
print(response.text)
# if the response is in JSON:
print(json.loads(response.text))
# or:
print(response.json())
# if the JSON contains the 'result' field (for UDPipe it does):
print(response.json()['result'])
The requests module makes an educated guess as of the encoding of the response. If it guesses wrong, you can set the encoding manually, e.g.:
response.encoding='utf8'
Some NLP tools with REST APIs available at ÚFAL:
links e.g. to Cat facts :-)
curl 'https://cat-fact.herokuapp.com/facts/random?animal_type=dog'
or to Random cats
import requests
from io import BytesIO
from PIL import Image
rcat = requests.get('https://aws.random.cat/meow')
img_url = rcat.json()['file']
rimg = requests.get(img_url)
img = Image.open(BytesIO(rimg.content))
img.show()
joining multiple things together:
def randomfact(animal='cat'):
url = 'https://cat-fact.herokuapp.com/facts/random?animal_type=' + animal
response = requests.get(url)
j = response.json()
print(j['text'])
d = nlp(j['text'])
for entity in d.ents:
print(entity, entity.label_)
hw_rest Building REST APIs Questions
Dec 13, 2023
data: WALS export
the source: The World Atlas of Language Structures Online
ideally use notebooks
try reading in the data using pandas
import pandas as pd
languages = pd.read_csv('language.tsv', sep='\t')
# pandas can read csv, tsv, xls, xlsx and other table formats
# pd.read_excel()
# header=None, names=["form", "pos"]
rows and columns
languages[0:5]
languages["Name"]
languages.Name
languages[["iso_code","Name"]]
languages.loc[0:5, ["iso_code","Name"]]
subselection, search
languages.genus == "Slavic"
languages.loc[languages.genus == "Slavic"]
languages.query('genus == "Slavic"')
languages.Name.str.startswith("Arabic")
languages.loc[languages.Name.str.startswith("Arabic")]
languages.query('Name.str.startswith("Arabic")')
add columns
languages.Name.str.split('(')
# expand=True
languages[["langbase", "langspec", "langspec2"]] = languages.Name.str.split('(', expand=True)
frequency lists
languages.genus.value_counts()
languages.genus.value_counts().plot(kind="bar")
languages.genus.value_counts().head().plot(kind="bar")
create
mydataset = {
'cars': ["BMW", "Volvo", "Ford"],
'passings': [3, 7, 2]
}
myvar = pd.DataFrame(mydataset)
misc
Series, DataFrame
head, tail
columns, index
describe
T
sort_values(by="genus)
at(row, column)
loc(rows, columns)
iat(row_index, column_index)
iloc(row_indexes, column_indexes)
languages[languages.genus.isin(["Slavic", "Germanic"])]
languages.genus.lower()
also exists: csv
import csv
with open("language.tsv", encoding="utf-8") as infile:
reader = csv.reader(infile, dialect="excel-tab")
for row in reader:
print(row)
Dec 20, 2023
Jan 3, 2024
Jan 10, 2024 Questions
hw_ssh
as described in the assignmentDuration: 10-30min 100 points Deadline: Oct 12 23:59, 2024
In this homework, you will practice working through SSH.
Connect remotely from your home computer to the MS lab
ssh
in terminalssh
in commandline -- open the command prompt (Windows+R),
type cmd
, press Enter, and you are in the commandlineCheck that you can see there the data from the class
Try practising some of the commands from the class: try renaming files, copying files, changing file permissions, etc.
Try to create a shell script called hello.sh
that prints some text, make it executable, and run it, e.g.:
echo 'echo Hello World' > hello.sh
chmod u+x hello.sh
./hello.sh
List your "friends":
friends.sh
that lists all users which
have the same first character of their username as you do./afs/ms/u/r/
.Put your scripts into a shared directory:
/afs/ms/doc/vyuka/INCOMING/TechnoNLP/
.hw_ssh
.hw_ssh
directory.You can also try connecting to the MS lab from your smartphone and running a few commands -- this will let you experience the power of being able to work remotely in Bash from anywhere...
You should be absolutely confident in doing these tasks. If you are not, take some more time to practice.
And, as always, contact us per e-mail if you run into any problems!
Duration: 10min-1h 98 points Deadline: Oct 19 23:59, 2024
Go again through the instructions for using Git and GitLab, and make sure everything works both on the lab computers (connect through SSH to check this) and on your home computer. Then proceed with the following "toy" homework assignment:
On your home computer, clone your repository from the remote repository (i.e. GitLab) and go into it.
In your Git repository,
create a directory called hw_git
and add it to into Git
(git add hw_git
).
In this directory, create a text file that contains at least 10 lines of text, e.g. copied from a news website. (and add it into Git).
Commit the changes locally (e.g. git commit -m'adding text file'
).
Create a new Bash script called sample.sh
in the directory.
When you run the Bash script (./sample.sh
), it should write out the
first 5 lines from the text file.
Commit the changes locally.
Push the changes to the remote repository (i.e. GitLab).
Connect to a lab computer through SSH, clone the repository from the remote repository (i.e. GitLab), try to find your script and run it to see that everything works fine. (If it does not, fix it.)
Still through SSH, change the script to only print first 2 lines from the file.
Commit and push the changes. (Even though the script file is already part of
the Git repository, i.e. it is "versioned", the new changes are not, so you still need to either add the
current version of the script again (git add sample.sh
),
or use commit
with the -a
switch which automatically adds all changes
to versioned files.)
Go back to your local clone of the repository on your home computer, pull the changes, and check that everything works correctly, i.e. that the script prints the first 2 lines from the file. (If it does not, fix it.)
In the local clone, change the script once more, so that it now prints the last 5 lines from the text file. Commit and push.
Go again into the repository clone stored in the lab, pull the changes, and check that the script works correctly. (If it does not, fix it.)
Copy your solutions for hw_ssh
into the Git repository.
Again, make sure to add them, commit them, push them, and check that they
work.
If you run into problems which you are unable to solve, ask for help!
You will submit some of the following assignments in this way, i.e. through Git, in a directory named identically to the assignment. Once you finish an assignment and submit it through git, always use SSH to connect to the lab, pull the assignment, and check that it works correctly.
Duration: 10min 2 points Deadline: Oct 19 23:59, 2024
Do the hw_hello
assignment in Recodex!
Duration: 1-6h 100 points Deadline: Oct 31 23:59, 2024
Do the hw_makefile
assignment in Recodex!
Duration: 1-4h 100 points Deadline: Nov 10 23:59, 2024
Do the hw_python
assignment in Recodex!
Duration: 2-6h 100 points Deadline: Nov 24 23:58, 2024
Do the hw_string
assignment in Recodex!
Duration: 1-4h 100 points Deadline: Nov 29 23:58, 2024
Do the hw_tagger
OR hw_tagger_simple
assignment in Recodex!
hw_tagger
hw_tagger_simple
so you
might also check that if you are unsure about some aspects of the
assignment (but beware that the interface is different there).hw_tagger
, do hw_tagger_simple
Duration: 2-6h 100 points Deadline: Dec 1 23:59,
Implement conversions between TSV, XML and JSON.
Commit this assignment into your Git repo.
sample.xml
sample.json
sample.tsv
Makefile
with targets download
, tsv2xml
, xml2json
, json2tsv
, and check
for the individual steps, and a target all
that runs them allDuration: 1-6h 100 points Deadline: Dec 8 23:59,
Train a model for an NLP framework.
Commit this assignment into your Git repo.
train, use and evaluate an NLP model in Spacy or NLTK, for Czech or another language
readme
, download
, train
, eval
, show
:
readme
prints out a short text saying what you did and how it went
download
downloads the linguistic data needed for training and
testing the model
train
trains the model
and stores it into a file or fileseval
evaluates the trained model
show
applies the model to a few sample sentences
head
to cut off just a part of the treebank so that the
training does not take ages...Spacy:
we suggest to train a model containing a part-of-speech tagger and a syntactic parser
see the Spacy manual for details: https://spacy.io/usage/training (and/or use Google); the following guidelines worked at some point but tend to get out of date with new versions of Spacy
note that in Spacy, Tagger only handles the tag
,
so you also need Morphologizer for pos
and Lemmatizer for lemma
convert your data to Spacy JSON format
e.g. converting a file train.conllu
and storing the converted data into
data
directory (the directory has to exist)
python3 -m spacy convert train.conllu data
use the online Spacy tool to create a base config file base_config.cfg
use spacy to create a full config file from the base config which you download from the online tool:
python -m spacy init fill-config base_config.cfg config.cfg
train a Spacy model
e.g. using train.spacy
and dev.spacy
data files to
train a model and save it into models
directory
the training keeps going over the train data to train a model
you can probably train for a much smaller number of epochs than the default, especially if you are training on the small dataset anyway
it also prints out some progress, repeatedly evaluating the model on the dev data, so you can observe how the tagging accuracy (Tag %) and syntactic parsing accuracy (UAS and LAS) keeps improving
python -m spacy train config.cfg --output ./models --paths.train data/train.spacy --paths.dev data/dev.spacy
try loading and using the model:
import spacy
nlp = spacy.load("models/model-best")
doc = nlp("some text")
...
evaluate the model on test data (POS is part of speech, UAS and LAS are syntactic parsing accuracies):
python3 -m spacy evaluate models/model-best data/test.spacy
NLTK:
we suggest to train a part-of-speech tagger
note that you have to convert the input data appropriately into a format which is expected by the tagger
to see what format the tagger expects, see e.g.:
from nltk.corpus import treebank
print(treebank.tagged_sents()[:3])
the corpus is a list of sentences
each sentence is a list of tokens
each token is a pair of word and tag
train_data = [
[ ('Čtvrť', 'N'), ('pro', 'R'), ('diplomaty', 'N') ],
[ ('Výstavbu', 'N'), ('diplomatické', 'A'), ('čtvrti', 'N'), ('v', 'R'), ('hlavním', 'A'), ('městě', 'N'),... ]
]
use any of the trainable taggers available in NLTK, e.g. TnT:
from nltk.tag import tnt
tnt_pos_tagger = tnt.TnT()
tnt_pos_tagger.train(train_data)
try out the model, e.g.:
tnt_pos_tagger.tag(nltk.word_tokenize("Dal bych si jedno pivo."))
evaluate the model, e.g.:
tnt_pos_tagger.evaluate(test_data)
if you want, you can experiment with multiple taggers and multiple settings and improvements to achieve a good accuracy
by default, the model returns something like Unk
for unknown words,
which is not very clever; so you might want to complement it with a
DefaultTagger
and set it to return a noun label for unknown words
to store and load the tagger, use e.g. pickle:
import pickle
with open('tnt_treebank_pos_tagger.pickle', 'wb') as f:
pickle.dump(tnt_pos_tagger, f)
with open('tnt_treebank_pos_tagger.pickle', 'rb') as f:
loaded_tagger = pickle.load(f)
Duration: 1-2h 100 points Deadline: Dec 15 23:59,
Build a script that uses some REST APIs.
readme
target explaning what your submission
does and how to use it
make readme
should print out sufficient
information for the user to use your submissionreadme
target!Sample questions for the final written test. The test is not limited to the following list. However, all the test questions will come from the below illustrated areas.
Some questions require you to write some code. As the test is computer-less, just pen and paper, you will be neither allowed nor required to run and debug the code on a computer. For this reason, we will not severely penalize minor errors in the code; we will understand the code as the first version you write before running it and debugging the various small errors.
Name and describe at least two options for each of the following commands in bash: ls, sort, cut, iconv, grep (1 point).
Give examples of what the .bashrc
file can be used for (1 point).
Explain how command line pipelining works (1 point).
Create a bash script that counts the total number of words in all *txt files in all subdirectories of the current directory (2 points).
You created a new file called doit.sh
and wrote some Bash commands into it,
e.g.:
echo "ls -t | head -n 5 | cat -n" > doit.sh
How do you run it now? (1 point)
What do you think the following command does?
ls -t | head -n 5 | cat -n
How would you check what it really does (without running it)? (1 point)
Explain the notions "character set" and "character encoding" (1 point).
Explain the main properties of ASCII (1 point).
What 8-bit encoding do you know for Czech or other European languages (or your native language)? Name at least three. How do they differ from ASCII? (1 point)
What is Unicode and what Unicode encodings do you know? (1 point)
Explain the relation between UTF-8 and ASCII. (1 point)
How can you detect the encoding of a file? (1 point)
You have three files containing identical Czech text. One of them is encoded using the ISO charset, one of them uses UTF-8, and one uses UTF-16. How can you tell which is which? (1 point)
How would you proceed if you are supposed to read a file encoded in ISO-8859-1, add a line number to each line and store it in UTF8? (a source code snippet in your favourite programming language is expected here) (2 points)
Name three Unicode encodings (1 point).
Explain the size difference between a file containing a text in Czech (or in your native language) stored in an 8-bit encoding and the same file stored in UTF-8. (1 point)
How do you convert a file from one encoding to another, for instance from a non-UTF-8 encoding to UTF-8? (1 point)
Write a Python script that reads a text content from STDIN encoded in ISO-8859-2 and prints it to STDOUT in utf8. (2 points)
Explain what BOM is (in the context of file encodings). (1 point)
What must be done if you have a CP1250-encoded HTML web page and you want to turn it into a UTF-8-encoded page? (1 point)
How are line ends encoded in plain text files? (1 point)
What would be the minimum and maximum expected size (in bytes) of a textual file that contains a 5-letter Czech word. Explain all reasons of this file size variability. (2 points)
How could you explain the situation in which you have a UTF8-encoded plain text file that contains two words which look exactly the same, but they don't fit string equality (and have different byte representations when being view using hexdump too)? (1 point)
How can you distinguish a file containing the Latin letter "A" from a file containing the Cyrilic letter "A" or the Greek letter "A"? (1 point)
Align screenshot pictures A-F with file encoding vs. view encoding situations I-IV. (2 points)
A.
B.
C.
D.
E.
F.
I. used file encoding: UTF-8 + used view encoding: UTF-8
II. used file encoding: UTF-8 + used view encoding: some 8-bit encoding
III. used file encoding: some 8-bit encoding + used view encoding: some other 8-bit encoding
IV. used file encoding: some 8-bit encoding + used view encoding: UTF-8
Using the Bash command line, get all lines from a file that contain one or two digits, followed by a dot or a space. (1 point)
Using the Bash command line, remove all punctuation from a given file. (1 point)
Using the Bash command line, split text from a given file into words, so that there is one word on each line. (1 point)
Using the Bash command line, download a webpage from a given URL and print the frequency list of opening HTML tags contained in the page. (2 points)
Using the Bash command line, print out the first 5 lines of each file (in the current directory) whose name starts with "abc". (2 points)
Using the Bash command line, find the most frequent word in a text file. (2 points)
Assume you have some linguistically analyzed text in a tab-separated file (TSV). You are just interested in the word form, which is in the second column, and the part-of-speech tag, which is in the fourth column. How do you extract only this information from the file using the Bash command line? (2 points)
Create a Makefile with three targets. The "download" target downloads the
webpage nic.nikde.eu
into a file, the "show" target prints out the file, and the "clean" target
deletes the file. (2 points)
Create a Makefile with two targets. When the first target is called, a web page is downloaded from a given URL. When the second target is called, the number of HTML paragraphs (<p> elements) contained in the file is printed. (2 points)
Suppose there is a plain-text file containing an English text. Write a Bash pipeline of commands which prints the frequency list of 50 most frequent tokens contained in the text. (Simplification: it is sufficient to use only whitespace characters as token separators) (2 points).
Assume you have some linguistic data in a text file. However, some lines are comments (these lines start with a "#" sign) and some lines are empty, and you are not interested in those. How do you get only the non-empty non-comment lines using the Bash command line? (2 points)
Assume you have some linguistically analyzed text in a comma-separated file (CSV). The first column is the token index — for regular tokens, this is simply a natural number (e.g. 1 or 128), for multiword tokens this is a number range (e.g. 5-8), and for empty tokens it is a decimal number (e.g. 6.1). How do you get only the lines that contain a regular token? (2 points)
Explain the following bash code:
grep . table.txt | rev | cut -f2,3 | rev
(1 point)
Create a bash script that reads an English text from STDIN and prints only interrogative sentences extracted from the text to STDOUT, one sentence per line (simplification: let's suppose that sentences can be ended only by fullstops and questionmarks). (2 points)
Write a bash script that returns a word-bigram frequency "table" (in the tab-separated format) for its input (2 points).
Write a Bash script that returns a letter-bigram frequency "table" (in the tab-separated format) for its input (2 points).
Name 4 Git commands and briefly explain what each of them does (a few words or a short sentence for each command) (1 point).
Assume you already are in a local clone of a remote Git repository. Create a new file called "a.txt" with the text "This is a file.", and do everything that is necessary so that the file gets into the remote repository (2 points).
Name two advantages of versioning your source codes (with Git) versus not versioning it (e.g. just having it in a directory on your laptop) (1 point).
You and your colleague are working together on a project versioned with Git.
Line 27 of script.py
is empty. You change that line to initialize a variable
("a = 10"), while you colleague changes it to modify another variable ("b +=
20"). He is faster than you, so he commits and pushes first. What happens
now? Can you push? Can you commit? What do you need to do now? (2 points)
What's probably wrong with the following sequence of commands? What did the author probably want to do? How would you correct it?
echo aaa > a; git add a; git push; git commit -m'creating a'
(2 points)
What's probably wrong with the following sequence of commands? What did the author probably want to do? How would you correct it?
echo aaa > a; git commit -m'creating a'; git push
(2 points)
What's probably wrong with the following sequence of commands? What did the author probably want to do? How would you correct it?
echo aaa > a; git add a; git push
(2 points)
text
, how do you get the following: first character, last character, first 3 characters, last 4 characters, 3rd to 5th character? (2 points)wc
: write a script that reads in the contents of a file, and prints out the number of characters, whitespace characters, words, lines and empty lines in the file. (2 points)genesis_text
contains a text, with punctuation removed,
i.e. there are just words separated by spaces. Print out the most frequent
word. (2 points)Name 5 string methods and explain what they do. (1 point)
Write a piece of code that prints out all numbers in a text (tokens that consist only of digits 0-9) joined by underscores (e.g. "L33t Peter has 5 apples, 123 oranges, an iPhone7 and 6466868 pears." becomes "5_123_6466868") (1 point)
Write a piece of code that replaces all occurences of "Python" by "vicious snake". (1 point)
Write a piece of code that decides whether a string looks like a name — one word consiting of an uppercase letter followed by lowercase letters. (1 point)
Write a piece of code that converts all dates in text from the format "nth/nd/rd Month" to "Month n", so e.g. "I was born on 29th January and my sister on 3rd February" becomes "I was born on January 29 and my sister on February 3" (1 point)
Write a piece of code that replaces all words that start with "pwd" by *****. (1 point)
Write a piece of code that converts the "'s" possessive to the "of" possessive, so that e.g. "I like Peter's car the most." becomes "I like car of Peter the most." (1 point)
Write a piece of code that takes a text in which some lines start with an asterisk and a space ("* ") and replaces the asterisks with consecutive ordinal numbers followed by a dot, starting with 1; e.g.:
Do not forget to buy:
* cheese
* wine
(just a cheap one)
* some bread
becomes:
Do not forget to buy:
1. cheese
2. wine
(just a cheap one)
3. some bread
(2 points)
Write a Python script that reads an English text from STDIN and prints the same text with 'highlighted' personal pronouns (e.g. by placing them between two asterisks *) (2 points).
Write a Python script that returns a word-bigram frequency table for its input. A text is expected on STDIN and a two column table is expected to be printed on STDOUT (2 points).
Write a Python script that returns a letter-bigram frequency table for its input (2 points).
Suppose you have a file containing a list of first names, one per line. Process another file containing an English text with Python, so that all personal names are shortened just to the initial letter and a dot, if a surname follows the first name. ("John Smith called me yesterday" → "J. Smith called me yesterday") (2 points)
Write a Python script that removes all leading and trailing whitespace from each input line, and replaces all the remaining sequences of whitespace characters with just one space. (2 points)
Create a Python script that reads an English text from STDIN and prints only interrogative sentences extracted from the text to STDOUT (simplification: let's suppose that sentences can be ended only by fullstops and questionmarks). (2 points)
Create a class representing a word form and its lemma, having a constructor and a method for writing out the word form and the lemma. (1 point)
Create a class with a non-empty constructor and create an instance of this class. (1 point)
Create a very simple Python object-oriented tree representation: create a class Node which has attribute children which keeps the list of the node's children, and attribute lemma (base form). There should be a method nodeA.add_child(lemma) which creates a new node (a child of nodeA) labelled with the given lemma. You can disregard any absolute and relative ordering of nodes (2 points).
What is the difference between a class and an object?
Name at least two advantages of using classes and objects (as compared to not using them). (1 point)
Write an example of the "name main" block. What does it do? (1 point)
You want to define a class in one Python file and then use it in another Python file. How do you do that? Explain this using examples of code you would write into these two files. (2 points)
What is XML? (1 point)
Explain the XML terms 'tag', 'attribute', and 'element'? (1 point)
What is a well-formed XML file? (1 point)
What is a valid XML file? (1 point)
What is the difference between XML well-formedness and XML validity? (1 point)
How can you check an XML file's well-formedness? (1 point)
Give an example of a correct HTML code fragment that does not conform to the XML rules? How can you make it XML-well-formed? (1 point)
Perform the minimum correction of the following XML fragment so that it becomes well formed:
<contact> --- Green&Son's email address is <grson@xmail.com> --- </contact>
(1 point)
Modify the following code so that it prints not only tags and attributes of elements directly embedded in the root element, but tags and attributes of all elements in the XML file (i.e., including the root and all deeper elements).
import xml.etree.cElementTree as ET
tree = ET.ElementTree(file='example.xml')
for child in root:
print child.tag, child.attrib
(2 points)
Create a Python script that reads a simple frequency list from STDIN (tab separated word form and frequency on each line) and turns it into a simple XML formatted file printed to STDOUT (2 points).
Let's assume that you need to store a sentence representation in which for each token the original word form as well as its lemma (base form) and part of speech are stored. Could you give an example of JSON code that could be used for such a structure exempliefied on a sentence with three words? (2 points)
Describe how the basic JSON data types could be mapped to Python types. (2 points)
Give examples of advantages (at least three) and disadvantages (at least two) of JSON compared to XML. (2 points)
Show how a phone-number book (a list of tuples name-surname-phonenumber) could be serialized using XML and using JSON. (2 points)
What are some advantages of using an existing NLP framework over writing all the codes yourself? (1 point)
What are some disadvantages of using an existing NLP framework over writing all the codes yourself? (1 point)
Name at least 4 things Spacy or NLTK can do (1 point).
Given a list of tokens, write code that POS-tags the tokens, using Spacy or NLTK (2 points).
Write a script that reads in English text which has one sentence per line and prints out POS tags for the words (one sentence per line, POS tags separated by spaces), using Spacy or NLTK (2 points).
Write code using Spacy or NLTK that takes English text and prints out the POS tag of the sentence-initial words (i.e. for each sentence, only print out the tag of its first word) (1 point).(2 points)
Given a list of tokens, POS-tag them with Spacy or NLTK and print out a frequency list of the tags (2 points).
Name at least 2 NLP frameworks or framework-like tools, say something about them in 1-2 lines (at least what they are good for) (1 point).
What are some advantages and disadvantages of using a RESZful service versus using a Python module to do the same task? (1 point)
What do you need to know about a RESTful service to be able to use it? (1 point)
Let's assume there is a RESTful service at http://example.com/weather that tells you the current weather in the city you specify via a parameter called "city". Use it to find out what the weather is in your hometown. (You can assume it suports both GET and POST methods, and that the response is in plain text.) (2 points)
Let's assume there is a RESTful service at http://example.com/joke that gives you a random joke, and that there is another RESTful service at http://example.com/postag that performs part-of-speech tagging of the text you send to it through its text
parameter.
Write code that gets a random joke and then gets the POS tags for it. (2 points)
How would you design a REST API for a part-of-speech tagger? No code, just what the request and response format would be. (1 point)
Suppose there are three files, a, b, and c. One of them contains text in English, the other two contain texts in other languages. Try to automatically detect which is the English one (i.e. "I look into the files with my eyes." is not a valid solution because this is not automatic) (2 points).
Assume that Rudolf simply runs the code you submit for homework on his computer without looking into the code. Why is that a bad idea? What could happen? Show why this is a bad idea by inventing a short part of code you could have submitted as homework. (2 points)
Assume you have a text file with one sentence on each line. Print only sentences that have exactly four words (2 points).
In NLP, we often lowercase all data, so that e.g. "And" (e.g. at the start of a sentence) and "and" (inside a sentence) are treated the same way. Why might this not be the best idea? What problems could we have because of that? What could be a better approach? (Don't write code, just explain this briefly with your own words.) (1 point)
Your grade is based on the average of your performance; the test and the homework assignments are weighted 1:1.
For example, if you get 600 out of 1000 points for homework assignments (60%) and 36 out of 40 points for the test (90%), your total performance is 75% and you get a 2.