Lab sessions
Every other Thursday (more or less), 9.00 a.m. in SW1
Jiří Mírovský
mirovsky at ufal.mff.cuni.cz
room 422
Daniel Zeman
zeman at ufal.mff.cuni.cz
room 409
(Archive of the practical classes from 2020)
Outline
- various formats for phrase-structure and dependency trees, transformations (Perl or another programming language)
- mining information from the word layer and morphological layer of the Prague Dependency Treebank or the Prague English Dependency Treebank (bash, Perl or another programming language)
- mining information from all layers of the PDT/PEDT (btred, Perl)
- mining information from UD data (btred, Perl)
- searching in treebanks with PML-Tree Query (PML-TQ)
Homeworks
Results of the homeworks (click here)
This year, all homeworks should be submitted by e-mail to mirovsky at ufal.mff.cuni.cz.
Classes 05 and 06 - May 9th and 23rd, 2024
Also the last two practical classes will be in the form of individual study and a homework solution. The topic will be "searching in PDT data using PML-TQ" and there will be one last homework (details soon).
PML-TQ (Prague Markup Language - Tree Query) is a client-server system for querying any treebank encoded in PML, i.e. any treebank that can be opened/edited in TrEd. PML-TQ queries can be processed either in TrEd (there is an extension for that), or on a server. If on a server, we can either use TrEd as a client (again, with the extension), or there is a web browser interface. For simplicity, we will use the web browser client.
Please follow the tutorial for "PML-TQ in the web browser client". Examples of queries in the tutorial can be run directly in the data of the Prague Dependency Treebank on the Lindat server hosted at ÚFAL by clicking on "try the query" buttons at the examples. Some links from the tutorial may not work. E.g., a link to a list of available treebanks seems to be broken - please, access directly the PDT 3.0 server by this link.
Homework 05: after studying the tutorial, solve the homework (due on May 28).
Class 04 - April 25th, 2024
Still in the mode of individual study and homework solution, we will continue our work with btred
, this time with Universal Dependencies.
TrEd, btred and Universal Dependencies
TrEd can be used for Universal Dependencies (UD) files, provided you have installed the extension for UD ("TrEd Extension to Work with Universal Dependencies" (TrEd -> Setup -> Manage Extensions -> Get New Extensions)). Then you can simply open a *.connlu
file.
Try downloading the data for the class, a single file from one of the English UD corpora,
or similarly sized (in number of words) Czech data, a part of the Prague Dependency Treebank in UD,
and open the data in TrEd.
With btred
, the situation is more complex. Btred (unlike Tred, it may be a btred bug) does not see libraries installed along with extensions. Therefore, to process UD files in btred
, use:
Once in each bash session:
PERL5LIB+=:~/.tred.d/extensions/ud/libs/
(or prefix any other path where you have your extensions installed (on some systems you may need to avoid '~' and use the full path instead); remember, in MS Windows, it may be C:\Users\[user_name]\AppData\Roaming\.tred.d),
or (better) add the abovementioned path to PERL5LIB in ~/.bashrc (or similarly, depending on your system and setup).
Then each time you run btred
, use switch -B Treex::PML::Backend::UD
, e.g.:
btred -B Treex::PML::Backend::UD -TNe 'writeln($this->attr("lemma"))' *.conllu
I suggest e.g. setting an alias in .bashrc:
alias udbtred='btred -B Treex::PML::Backend::UD'
Homework 04: After setting-up btred for Universal Dependencies (see above), read carefully the instructions for the homework (due on May 7th).
Class 03 - April 11th, 2024
Unfortunately, I am still ill, so the lesson will once again be in the form of individual study and homework solution. The goal today is to learn how, in btred
, to cross from the tectogrammatical layer to the analytical layer, and how to use hooks.
Homework 03: Please finish steps 6 and 7 of the tutorial to btred (btred tutorial), and then read carefully the instructions for the homework (due on April 23rd).
Class 02 - March 28th, 2024
Because of my continuing illness, the lesson will be in the form of individual study and homework solution. The goal of the individual study and the homework is to learn to use btred
, a scripting tool for working with TrEd data.
Homework 02: After you finish the tutorial to btred (see below), read carefully the instructions for the homework (due on April 9th).
Individual study
Our task is to learn to use btred
, a scripting tool for working with TrEd data.
Let me start with a few clarifying statements, which you may already know:
- Tree editor TrEd works primarily with data encoded in the Prague Markup Language (PML), which is a general XML format for encoding linguistically annotated treebanks. Specifics of individual treebanks (annotation layers, types of nodes, types of relations between the nodes, attributes defined at nodes, etc.) are defined in TrEd extensions.
-
btred
is a command line scripting interface for the PML data. It can use both general functions defined in the PML and treebank-specific functions defined in TrEd extensions. -
TrEd and
btred
are written in Perl, so also scripts forbtred
need to be written in Perl. However, you only need basic knowledge of Perl.
Your task
Please, follow the btred tutorial, which will take you through first steps of working with btred
. Our plan is to cover steps 1-5 and 7 of the tutorial in this class and 6 in the next class. For the homework, knowledge from steps 1-5 and 7 should suffice.
For the tutorial, you can use any PDT-like data, e.g. these PDT data containing annotation of texts up to the analytical layer (a-layer). For examples that use the tectogrammatical layer (t-layer), use these data from the PDT, which contain also the t-files.
As Perl may be new to you and btred most certainly is, let me give you a few hints to make your work with Perl and btred easier. You will see in the tutorial that btred scripts may start with three different lines:
-
#!btred -e function_to_run()
# the functionfunction_to_run()
will be run once for each given file. You can get an array of all trees (their roots) in the given file bymy @roots = GetTrees()
. -
#!btred -T -e function_to_run()
# the functionfunction_to_run()
will be run on each tree in the given files. Variable$root
will contain the root of the current tree. You can get an array of all nodes (incl. the root) in the given tree bymy @nodes = GetNodes($root)
, or (excl. the root) bymy @nodes = $root->descendants
-
#!btred -TN -e function_to_run()
# the functionfunction_to_run()
will be run on all nodes in all trees in the given files. Variable$root
will contain the root of the current tree and variable$this
will contain the current node.
A simple script for counting all nodes in each given file and printing the number next to each file name might look like this:
#!btred -e count_nodes() sub count_nodes { my $total_count = 0; my @roots = GetTrees(); # get an array of roots of all trees in the file foreach my $root (@roots) { my @nodes = GetNodes($root); # get an array of all nodes in the tree (incl. the root) my $number_of_nodes = scalar(@nodes); # get the length of the array $total_count += $number_of_nodes; } my $filename = FileName(); print "$filename: $total_count\n"; }
If the script is named count.btred
, it can be run on all gzipped a-files in the current directory from a terminal with the following command:
btred -I count.btred *.a.gz
I suggest that after the first line in any btred (or Perl in general) script, you add the following Perl instructions:
use strict; # it informs e.g. about non-declared variables (often typos) use warnings; # it warns e.g. if an uninitialized variable is used (e.g. in addition or concatenation) use utf8; # it allows utf8 in the script source code binmode STDIN, ':utf8'; # setting utf8 for STDIN binmode STDOUT, ':utf8'; # dtto for STDOUT binmode STDERR, ':utf8'; # dtto for STDERR
Manuals and documentation
For writing btred scripts generally and for a particular treebank (say, a PDT-like treebank), there are three main sources of information:
-
TrEd/btred user manual, namely its section 15 - User Macros, and most importantly its subsections 15.8. Public API: pre-defined macros (functions
GetNodes
,GetTrees
,ListV
,Filename
, etc.) and 15.9. Hooks: automatically executed macros (functionexit_hook
, which is executed once after all input files are processed), -
documentation to Treex::PML - the fundamental libraries used by the TrEd toolkit, first of all documentation to Treex::PML::Node (object methods such as
parent
,firstson
,level
,attr
,set_attr
,children
,descendants
,ancestors
, etc.), -
documentation for the PDT extension (treebank-specific functions such as
PML_A::GetSentenceString
,PML_A::GetEParents
, etc.)
Exercise after 5 steps of the tutorial
After you finish the first five steps of the tutorial, you may practice the aquired knowledge, if you want, on the following tasks (you may not need to actually write the scripts; just thinking the tasks through may suffice):
- Print the id of each tree root and the depth of the tree (the length of the longest path from the root to a leaf) to STDOUT.
-
Count (and print to STDOUT) the distribution of sizes (number of nodes) of all trees in the given files (i.e., how many times there are trees with 1 node, 2 nodes, 3 nodes, etc.). You may count the final distribution in an outside script (bash) or use the function
exit_hook
. -
Print all sentences in the a-files that are shorter than 5 tokens (there is a function
PML_A::GetSentenceString($root)
defined in the TrEd extension for the PDT.
Exercise after 7 steps of the tutorial
- Count a distribution of morphological tags (use only first five positions of the tag) for a-nodes with afun Pred, similarly for a-nodes with afun 'Sb'.
As another example, below is a script for printing out verb-less sentences from the analytical layer:
#!btred -T -e function_to_run() use strict; sub function_to_run { my $found = 0; # remember if a verb has been found my @nodes = GetNodes($root); # get all nodes in the tree foreach my $node (@nodes) { # and process them one by one my $tag = $node->attr('m/tag'); # get the morphological tag if ($tag =~ /^V/) { # check if it is a verb $found = 1; # if it is, remember that a verb has been found last; # and finish the cycle } } if ($found == 0) { # if a verb was not found in the sentence my $sent = PML_A::GetSentenceString($root); # get the sentence print "$id: $sent\n"; # and print it out } }
Class 01 - March 14th, 2024
For illness, the lesson will be in the form of individual homework solution. Please read carefully Homework 01. Send the results to me by e-mail (mirovsky@ufal.mff.cuni.cz) by March 24th. Also, ask me by e-mail if you encounter difficulties or if something is unclear in the instructions.
Class 00 - February 22nd, 2024
The first task for today: Installation of tree editor TrEd on computers in the lab or at your personal computers from the TrEd home page.
On MS Windows, use the installation package containing also the Strawberry Perl distribution.
On Linux, follow these instructions:
-
Setting up cpan (so that it uses local directories; http://www.perlmonks.org/?node_id=630026):
-
mkdir -p ~/.cpan/CPAN
echo "1" >~/.cpan/CPAN/MyConfig.pm -
perl -MCPAN -e shell
cpan> o conf init # (use local::lib and at the end, allow setting (or set manually) some variables in .bashrc file) - exit cpan, exit and start bash
-
mkdir -p ~/.cpan/CPAN
-
Installing cpanm for easier installation of other Perl modules
- cpan App::cpanminus
-
Installing Tred
- wget http://ufal.mff.cuni.cz/tred/tred-current.tar.gz
- tar -zxvf tred-current.tar.gz
- cd tred
-
./tred
- install missing libraries as reported and repeat
The second task for today: Test the installation, setup TrEd (extensions, fonts)
After we have installed TrEd, let us try it - download the following data:
lindat.cz - go to "Repository" and search for Prague Dependency Treebank 2.0 - sample data
After you unzip the data, try to open one of the .t.gz files. You should get an error message complaining about missing schemas. It is because you also need to install a TrEd extension for the given type of data:
In TrEd, go to Setup -> Manage Extensions -> Get New Extensions and search for pdt20. Check it and press "Install Selected". Now close and start TrEd again. It should be able to open the data now.
You can customize TrEd in the configuration file .tredrc
- customize fonts: font section in Tred documentation
TrEd can handle all types of treebanks - try, e.g., an example from the Penn Treebank:
An example file from the Penn Treebank transformed to the PML - to test the TrEd installation; an extension for the Penn Treebank files (ptb) needs to be installed first...
English data to test the TrEd installation: section 000 of PEDT
- Download and unzip the data. It is section 000 of the Penn Treebank transformed to the format of the Prague treebank family; in this particular case, one document is represented by three files corresponding to surface syntax - analytical layer (a-files), deep syntax - tectogrammatical layer (t-files), and original phrase structure layer (p-files). Then try to open one of the t-files in TrEd (you will need to install extension pedt), you can also open an a-file and a p-file.
The third task for today: Transform phrase-structure trees to dependency ones
Sample phrase structure tree (file)
S ( NP ( N ( 'Peter' ) ) * VP ( * V ( 'gave' ) NP ( D ( 'a' ) * N ( 'flower' ) ) PP ( * P ( 'to' ) N ( 'Mary' ) ) ) )
Another sample phrase structure tree (file)
S ( NP ( A ('Young') * N ('men')) * VP (* V ( 'love' ) COORD (NP ( N('beer')) * CONJ ('and') NP ( N( 'girls' ) ) )) )