Lab sessions

Every other Thursday (more or less), 9.00 a.m. in SW1

Jiří Mírovský
mirovsky at ufal.mff.cuni.cz
room 422

Daniel Zeman
zeman at ufal.mff.cuni.cz
room 409

(Archive of the practical classes from 2020)

Outline

  • various formats for phrase-structure and dependency trees, transformations (Perl or another programming language)
  • mining information from the word layer and morphological layer of the Prague Dependency Treebank or the Prague English Dependency Treebank (bash, Perl or another programming language)
  • mining information from all layers of the PDT/PEDT (btred, Perl)
  • mining information from UD data (btred, Perl)
  • searching in treebanks with PML-Tree Query (PML-TQ)

Homeworks

Results of the homeworks (click here)

This year, all homeworks should be submitted by e-mail to mirovsky at ufal.mff.cuni.cz.
 


Classes 05 and 06 - May 9th and 23rd, 2024

Also the last two practical classes will be in the form of individual study and a homework solution. The topic will be "searching in PDT data using PML-TQ" and there will be one last homework (details soon).

PML-TQ (Prague Markup Language - Tree Query) is a client-server system for querying any treebank encoded in PML, i.e. any treebank that can be opened/edited in TrEd. PML-TQ queries can be processed either in TrEd (there is an extension for that), or on a server. If on a server, we can either use TrEd as a client (again, with the extension), or there is a web browser interface. For simplicity, we will use the web browser client.

Please follow the tutorial for "PML-TQ in the web browser client". Examples of queries in the tutorial can be run directly in the data of the Prague Dependency Treebank on the Lindat server hosted at ÚFAL by clicking on "try the query" buttons at the examples. Some links from the tutorial may not work. E.g., a link to a list of available treebanks seems to be broken - please, access directly the PDT 3.0 server by this link.

Homework 05: after studying the tutorial, solve the homework (due on May 28).


Class 04 - April 25th, 2024

Still in the mode of individual study and homework solution, we will continue our work with btred, this time with Universal Dependencies.

TrEd, btred and Universal Dependencies

TrEd can be used for Universal Dependencies (UD) files, provided you have installed the extension for UD ("TrEd Extension to Work with Universal Dependencies" (TrEd -> Setup -> Manage Extensions -> Get New Extensions)). Then you can simply open a *.connlu file.

Try downloading the data for the class, a single file from one of the English UD corpora,
or similarly sized (in number of words) Czech data, a part of the Prague Dependency Treebank in UD,
and open the data in TrEd.

With btred, the situation is more complex. Btred (unlike Tred, it may be a btred bug) does not see libraries installed along with extensions. Therefore, to process UD files in btred, use:

Once in each bash session:

    PERL5LIB+=:~/.tred.d/extensions/ud/libs/

    (or prefix any other path where you have your extensions installed (on some systems you may need to avoid '~' and use the full path instead); remember, in MS Windows, it may be C:\Users\[user_name]\AppData\Roaming\.tred.d),

or (better) add the abovementioned path to PERL5LIB in ~/.bashrc (or similarly, depending on your system and setup).

Then each time you run btred, use switch -B Treex::PML::Backend::UD, e.g.:

    btred -B Treex::PML::Backend::UD -TNe 'writeln($this->attr("lemma"))' *.conllu

I suggest e.g. setting an alias in .bashrc:

   alias udbtred='btred -B Treex::PML::Backend::UD'

Homework 04: After setting-up btred for Universal Dependencies (see above), read carefully the instructions for the homework (due on May 7th).


Class 03 - April 11th, 2024

Unfortunately, I am still ill, so the lesson will once again be in the form of individual study and homework solution. The goal today is to learn how, in btred, to cross from the tectogrammatical layer to the analytical layer, and how to use hooks.

Homework 03: Please finish steps 6 and 7 of the tutorial to btred (btred tutorial), and then read carefully the instructions for the homework  (due on April 23rd).


Class 02 - March 28th, 2024

Because of my continuing illness, the lesson will be in the form of individual study and homework solution. The goal of the individual study and the homework is to learn to use btred, a scripting tool for working with TrEd data.

Homework 02: After you finish the tutorial to btred (see below), read carefully the instructions for the homework  (due on April 9th).

Individual study

Our task is to learn to use btred, a scripting tool for working with TrEd data.

Let me start with a few clarifying statements, which you may already know:

  • Tree editor TrEd works primarily with data encoded in the Prague Markup Language (PML), which is a general XML format for encoding linguistically annotated treebanks. Specifics of individual treebanks (annotation layers, types of nodes, types of relations between the nodes, attributes defined at nodes, etc.) are defined in TrEd extensions.
  • btred is a command line scripting interface for the PML data. It can use both general functions defined in the PML and treebank-specific functions defined in TrEd extensions.
  • TrEd and btred are written in Perl, so also scripts for btred need to be written in Perl. However, you only need basic knowledge of Perl.

Your task

Please, follow the btred tutorial, which will take you through first steps of working with btred. Our plan is to cover steps 1-5 and 7 of the tutorial in this class and 6 in the next class. For the homework, knowledge from steps 1-5 and 7 should suffice.

For the tutorial, you can use any PDT-like data, e.g. these PDT data containing annotation of texts up to the analytical layer (a-layer). For examples that use the tectogrammatical layer (t-layer), use these data from the PDT, which contain also the t-files.

As Perl may be new to you and btred most certainly is, let me give you a few hints to make your work with Perl and btred easier. You will see in the tutorial that btred scripts may start with three different lines:

  • #!btred -e function_to_run() # the function function_to_run() will be run once for each given file. You can get an array of all trees (their roots) in the given file by my @roots = GetTrees().
  • #!btred -T -e function_to_run() # the function function_to_run() will be run on each tree in the given files. Variable $root will contain the root of the current tree. You can get an array of all nodes (incl. the root) in the given tree by my @nodes = GetNodes($root), or (excl. the root) by my @nodes = $root->descendants
  • #!btred -TN -e function_to_run() # the function function_to_run() will be run on all nodes in all trees in the given files. Variable $root will contain the root of the current tree and variable $this will contain the current node.

A simple script for counting all nodes in each given file and printing the number next to each file name might look like this:

#!btred -e count_nodes()

sub count_nodes {
    my $total_count = 0;
    my @roots = GetTrees();  # get an array of roots of all trees in the file
    foreach my $root (@roots) {
      my @nodes = GetNodes($root);  # get an array of all nodes in the tree (incl. the root)
      my $number_of_nodes = scalar(@nodes);  # get the length of the array
      $total_count += $number_of_nodes;
    }
    my $filename = FileName();
    print "$filename: $total_count\n";
}

If the script is named count.btred, it can be run on all gzipped a-files in the current directory from a terminal with the following command:

btred -I count.btred *.a.gz

I suggest that after the first line in any btred (or Perl in general) script, you add the following Perl instructions:

use strict;  # it informs e.g. about non-declared variables (often typos)
use warnings;  # it warns e.g. if an uninitialized variable is used (e.g. in addition or concatenation)
use utf8;  # it allows utf8 in the script source code
binmode STDIN, ':utf8';  # setting utf8 for STDIN
binmode STDOUT, ':utf8';  # dtto for STDOUT
binmode STDERR, ':utf8';  # dtto for STDERR

Manuals and documentation

For writing btred scripts generally and for a particular treebank (say, a PDT-like treebank), there are three main sources of information:

Exercise after 5 steps of the tutorial

After you finish the first five steps of the tutorial, you may practice the aquired knowledge, if you want, on the following tasks (you may not need to actually write the scripts; just thinking the tasks through may suffice):

  • Print the id of each tree root and the depth of the tree (the length of the longest path from the root to a leaf) to STDOUT.
  • Count (and print to STDOUT) the distribution of sizes (number of nodes) of all trees in the given files (i.e., how many times there are trees with 1 node, 2 nodes, 3 nodes, etc.). You may count the final distribution in an outside script (bash) or use the function exit_hook.
  • Print all sentences in the a-files that are shorter than 5 tokens (there is a function PML_A::GetSentenceString($root) defined in the TrEd extension for the PDT.

Exercise after 7 steps of the tutorial

  • Count a distribution of morphological tags (use only first five positions of the tag) for a-nodes with afun Pred, similarly for a-nodes with afun 'Sb'.

As another example, below is a script for printing out verb-less sentences from the analytical layer:

#!btred -T -e function_to_run()
use strict;

sub function_to_run {
  my $found = 0; # remember if a verb has been found
  my @nodes = GetNodes($root); # get all nodes in the tree
  foreach my $node (@nodes) { # and process them one by one
    my $tag = $node->attr('m/tag'); # get the morphological tag
    if ($tag =~ /^V/) { # check if it is a verb
      $found = 1; # if it is, remember that a verb has been found
      last; # and finish the cycle
    }
  }

  if ($found == 0) { # if a verb was not found in the sentence
    my $sent = PML_A::GetSentenceString($root); # get the sentence
    print "$id: $sent\n"; # and print it out
  }
}

 


Class 01 - March 14th, 2024

For illness, the lesson will be in the form of individual homework solution. Please read carefully Homework 01. Send the results to me by e-mail (mirovsky@ufal.mff.cuni.cz) by March 24th. Also, ask me by e-mail if you encounter difficulties or if something is unclear in the instructions.


Class 00 - February 22nd, 2024

(slides from the class)

The first task for today: Installation of tree editor TrEd on computers in the lab or at your personal computers from the TrEd home page.

On MS Windows, use the installation package containing also the Strawberry Perl distribution.

On Linux, follow these instructions:

  1. Setting up cpan (so that it uses local directories; http://www.perlmonks.org/?node_id=630026):
    • mkdir -p ~/.cpan/CPAN
      echo "1" >~/.cpan/CPAN/MyConfig.pm
    • perl -MCPAN -e shell
      cpan> o conf init # (use local::lib and at the end, allow setting (or set manually) some variables in .bashrc file)
    • exit cpan, exit and start bash
  2. Installing cpanm for easier installation of other Perl modules
    • cpan App::cpanminus
  3. Installing Tred

The second task for today: Test the installation, setup TrEd (extensions, fonts)

After we have installed TrEd, let us try it - download the following data:
lindat.cz - go to "Repository" and search for Prague Dependency Treebank 2.0 - sample data

After you unzip the data, try to open one of the .t.gz files. You should get an error message complaining about missing schemas. It is because you also need to install a TrEd extension for the given type of data:

In TrEd, go to Setup -> Manage Extensions -> Get New Extensions and search for pdt20. Check it and press "Install Selected". Now close and start TrEd again. It should be able to open the data now.

You can customize TrEd in the configuration file .tredrc - customize fonts: font section in Tred documentation

TrEd can handle all types of treebanks - try, e.g., an example from the Penn Treebank:

An example file from the Penn Treebank transformed to the PML - to test the TrEd installation; an extension for the Penn Treebank files (ptb) needs to be installed first...

English data to test the TrEd installation: section 000 of PEDT

  • Download and unzip the data. It is section 000 of the Penn Treebank transformed to the format of the Prague treebank family; in this particular case, one document is represented by three files corresponding to surface syntax - analytical layer (a-files), deep syntax - tectogrammatical layer (t-files), and original phrase structure layer (p-files). Then try to open one of the t-files in TrEd (you will need to install extension pedt), you can also open an a-file and a p-file.

The third task for today: Transform phrase-structure trees to dependency ones

Sample phrase structure tree (file)

S (
  NP ( N ( 'Peter' ) )
  * VP ( * V ( 'gave' )
         NP ( D   ( 'a' )
              * N ( 'flower' ) )
         PP ( * P ( 'to' )
              N   ( 'Mary' ) )
       )
)

Another sample phrase structure tree (file)

S (
  NP ( A ('Young') * N ('men'))
  * VP (* V ( 'love' )
        COORD (NP ( N('beer'))
               * CONJ ('and')
               NP ( N( 'girls' ) )
        ))
)