If an application is supposed to understand natural language to some extent, it usually has to (syntactically) parse the input utterances. I.e., it attempts to discover relations between words of the sentence, and the way their meanings combine to form the overall meaning of the sentence. We call the application module responsible for that a parser.
A whole range of syntactic formalisms have been proposed to model the syntax of a natural language. Most parsers heavily rely on treebanks - corpora of written or spoken utterances, in which the word-to-word relations have been annotated manually by linguistically trained annotators. As most treebanks are bound to a particular syntactic formalism, the formalism used by a parser is usually determined by the data available for its training.
Two particular formalisms deserve our special attention: the dependency syntax, as defined in the Prague Dependency Treebank (PDT), and the constituent syntax, as defined in the Penn TreeBank. The former is the most important (and the only for which a treebank is available) formalism for Czech; the latter applies to English and was historically more popular in some parts of the world.
PDT has been annotated in three layers, called morphological, analytical and tectogrammatical. The analytical layer corresponds to the surface, the tectogrammatical to the deep syntax. Thus the analytical representation (AR) is closer to the appearance of the sentence in the text while the tectogrammatical one (TR) is closer to the meaning. PDT 1.0 (released 2001) contained ARs and just a tiny sample of TRs. Considerable amount of TRs appear first in PDT 2.0 (released 2005).
The annotation in ARs consists of two parts: the dependency structure (tree), and the analytical functions (also syntactic tags, s-tags or dependency relation labels). Some parsers concentrate only on the tree structure and do not assign the s-tags. Nevertheless, the s-tag assignment is a rather easy task once the structure has been built. The linguistic description of how the ARs of particular language constructions should look like is given in the Manual for the annotators (Czech version here). The list and description of possible s-tags is given there as well.
An analytical dependency structure is a rooted tree where each node (except of the root) corresponds to one word of the underlying sentence (and for each word there is a corresponding node). The simplest representation of such a tree is a sequence of integer numbers: i-th position in the sequence corresponds to the i-th word of the underlying sentence, and the number in that position is interpreted as the index of the word, on which the i-th word depends. We use the terms dependent, depending node or child for the i-th word, and governor, governing node or parent for the other word.
The standard method of evaluating parser accuracy is computing the percentage of children that got the correct parent index, among all words in a test data set. This is also called the unlabeled attachment score (UAS) to emphasize that labels of the dependency relations are not evaluated. Alternatively we can require that both the parent is identified and the relation is labeled correctly. Then we have the labeled attachment score (LAS). Unless specifically noted otherwise, the term accuracy in this overview means UAS.
PDT 1.0 provides two data sets inteded to evaluate analytical parsers, the d-test (development) and the e-test (cross-evaluation). See also the PDT 1.0 Data Layout Table. The d-test consists of 153 files, 7319 non-empty sentences, and 126,030 words. The evaluation on the d-test data is available for most parsers, so for the sake of comparability we stick with that data here.
The following table gives the accuracy figures for various parsers on the PDT 1.0 d-test data. (Note: the development of some of the parsers is going on. We try to maintain here either their published results, or the results we measured ourselves in case we have the parser or its output on d-test data available.)
Author (parser) | Accuracy | Notes |
---|---|---|
Combination ec+mc+zž+dz | 86.3 | Zeman & Žabokrtský (2005) |
Hall/Novák/Charniak | 85.0 | Hall & Novák (2005) |
Ryan McDonald | 84.4 | McDonald et al. (2005) |
Eugene Charniak | 84.3 | Charniak (2000) describes the original parser for English. Czech results measured by Zeman on the output provided by Charniak in 2003. |
Michael Collins | 82.5 | Collins et al. (1999) gives results on PDT 0.5. Re-run and re-measured on PDT 1.0 by Zeman. |
Joakim Nivre | 80.1 | Nivre & Nilsson (2005) |
Zdeněk Žabokrtský | 75.2 | Parser run and accuracy measured by Zeman in 2004. |
Daniel Zeman | 74.7 | Zeman (2004a) |
Václav Klimeš | 74.7 | Accuracy reported by Klimeš in 2006; to be published. |
Tomáš Holan (r2l) | 71.7 | Measured by Zeman on parser output provided by Holan in early 2004. |
Tomáš Holan (l2r) | 69.9 | Measured by Zeman on parser output provided by Holan in early 2004. |
Tomáš Holan (pshrt) | 62.8 | Measured by Zeman on parser output provided by Holan in early 2004. |
Note that due to version incompatibility, Charniak's parser cannot be re-trained. The Collins' parser was included on the PDT 2.0 CD-ROM.
PDT 2.0 provides two data sets inteded to evaluate analytical parsers, the d-test (development) and the e-test (cross-evaluation). Each of those sets is split into two parts, one that has tectogrammatical annotation as well (tamw/[de]test/*.a
) and one that has not (amw/[de]test/*.a
). For analytical parsing, both parts have to be combined. See also the PDT 2.0 Data Description. The training data consists of 68,562 sentences and 1,172,299 tokens. The d-test data consists of 9,270 sentences and 158,962 tokens. The e-test data consists of 10,148 sentences and 173,586 tokens. Do not use this data to test parsers that have been trained on PDT 1.0! Some of the current test data were declared as training data in PDT 1.0!
Extra care must be taken when running parsing experiments or reporting results on PDT 2.0 as to which source of morphological information was used by the parser: undisabiguated, automatically disambiguated (by which tagger?) or manually disambiguated. In fact, the same had to be taken into account when working with PDT 1.0. However, with version 2.0 it is easier to overlook that one is actually working with the wrong source of morphology, because:
It is strongly recommended that anyone report results of experiments where the parser had not access to any human annotation in the test data, including morphology (of course, use everything you find useful in the training data). The obvious reason is that your parser is unlikely to have such information available in a real-world application.
The following table gives the accuracy figures for various parsers on the PDT 2.0 test data. (Note: the development of some of the parsers is going on. We try to maintain here either their published results, or the results we measured ourselves in case we have the parser or its output on d-test data available.)
Author (parser) | D-test accuracy |
E-test accuracy |
Notes |
---|---|---|---|
Combination rmd+mc+zž+5×th* | 86.2 | 85.8 | Holan & Žabokrtský (2006), Simply Weighted Parsers (SWP) |
Hall/Nilsson/Nivre | 86.0 | 85.8 | Malt Parser 1.7 with stacklazy algorithm and Java implementation of LibSVM learner (see Nivre (2009)), run by Zeman in June 2013, using feature definition file provided by the Uppsala team. Automatically disambiguated tags used during both training and parsing. |
McDonald/Novák/Žabokrtský | 84.7 | Feature engineering over McDonald's MST parser. See Novák & Žabokrtský (2007). | |
Ryan McDonald | 84.2 | 84.0 | Same parser as in McDonald et al. (2005), run by Václav Novák in 2006. PDT 2.0 results published in Holan & Žabokrtský (2006). Automatically disambiguated tags used during both training and parsing. |
Michael Collins | 81.6 | 80.9 | Same parser as in Collins et al. (1999), PDT 2.0 results published in Holan & Žabokrtský (2006). Automatically disambiguated tags used during both training and parsing. |
Zdeněk Žabokrtský | 76.1 | 75.9 | A rule-based parser, described in Holan & Žabokrtský (2006). Automatically disambiguated tags used. |
Daniel Zeman | 75.0 | 74.8 | Same parser and settings as in Zeman (2004a), run by Zeman in 2006. Automatically disambiguated tags used during both training and parsing. |
Václav Klimeš | 74.8 | 74.6 | Accuracy reported by Klimeš in 2006; to be published. Automatically disambiguated tags used during both training and parsing. |
Tomáš Holan (r2l) | 74.0 | 73.9 | Pushdown automaton parser (Holan & Žabokrtský, 2006). Automatically disambiguated tags used. |
Tomáš Holan (l2r) | 71.4 | 71.3 | Pushdown automaton parser (Holan & Žabokrtský, 2006). Automatically disambiguated tags used. |
Tomáš Holan (analog) | 71.5 | 71.1 | A parser that “searches for the local tree configuration most similar to the training data” (Holan & Žabokrtský, 2006) (after all, which parser does not?) The parser itself shall be described in Holan (2005). Automatically disambiguated tags used. |
Tomáš Holan (r23) | 61.1 | 61.7 | Pushdown automaton parser (Holan & Žabokrtský, 2006). 3-letter word endings used, instead of tags. |
Tomáš Holan (l23) | 54.9 | 53.3 | Pushdown automaton parser (Holan & Žabokrtský, 2006). 3-letter word endings used, instead of tags. |
In their EMNLP paper, Koo et al. (2010) report unlabeled accuracy of 87.32 % on PDT. Unfortunately they do not specify what version of PDT and what test dataset they used, not to mention the distinction between gold and automatically disambiguated morphology. So it is difficult to tell how this result compares to the others.
The CoNLL-X (2006) shared task involved dependency parsing of 13 languages including Czech. Training and test data were taken from PDT 1.0. However, the published results are not directly comparable to the results presented above because of the following reasons:
For an overview of the results by the various teams, see Buchholz & Marsi (2006).
Authors | Labeled accuracy | Notes |
---|---|---|
Joakim Nivre | 82.4 | Run later on the CoNLL-X data, see Nivre (2009). |
Ryan McDonald, Kevin Lerman, Fernando Pereira | 80.2 | |
Joakim Nivre, Johan Hall, Jens Nilsson, Gülşen Eryiğit, Svetoslav Marinov | 78.4 | |
John O'Neil | 76.6 | |
Yuchang Cheng, Masayuki Asahara, Yuji Matsumoto | 76.2 | |
Kenji Sagae | 75.2 | |
Simon Corston-Oliver, Anthony Aue | 74.5 | |
Ming-Wei Chang, Quang Do, Dan Roth | 72.9 | |
Richard Johansson, Pierre Nugues | 71.5 | |
Xavier Carreras, Mihai Surdeanu, Lluís Màrquez | 68.8 | |
Sebastian Riedel, Ruket Çakıcı, Ivan Meza-Ruiz | 67.4 | |
Eckhard Bick | 63.0 | |
Sander Canisius, Toine Bogers, Antal van den Bosch, Jeroen Geertzen, Erik Tjong Kim Sang | 60.9 | |
Markus Dreyer, David A. Smith, Noah A. Smith | 60.5 | |
Giuseppe Attardi | 59.8 | |
Yu-Chieh Wu, Yue-Shi Lee, Jie-Chi Yang | 59.4 | |
Ting Liu, Jinshan Ma, Huijia Zhu, Sheng Li | 58.5 | |
Michael Schiehlen, Kristina Spranger | 53.3 | |
Deniz Yuret | 51.9 |
The CoNLL 2007 shared task involved dependency parsing of 10 languages including Czech. Training and test data were taken from PDT 2.0.
For an overview of the results by the various teams, see Nivre et al. (2007).
Authors | Labeled | Unlabeled |
---|---|---|
Tetsuji Nakagawa | 80.19 | 86.28 |
Xavier Carreras | 78.60 | 85.16 |
Jens Nilsson, Johan Hall, Joakim Nivre, Gülşen Eryiğit, Beáta Megyesi, Mattias Nilsson, Markus Saers | 77.98 | 83.59 |
Ivan Titov, James Henderson | 77.94 | 84.19 |
Giuseppe Attardi, Felice Dell'Orletta, Maria Simi, Atanas Chanev, Massimiliano Ciaramita | 77.37 | 83.40 |
Johan Hall, Jens Nilsson, Joakim Nivre, Gülşen Eryiğit, Beáta Megyesi, Mattias Nilsson, Markus Saers | 77.22 | 82.35 |
Xiangyu Duan, Jun Zhao, Bo Xu | 75.34 | 80.82 |
Kenji Sagae, Jun'ichi Tsujii | 74.83 | 81.27 |
Michael Schiehlen, Kristina Spranger | 73.86 | 81.73 |
Wenliang Chen, Yujie Chang, Hitoshi Isahara | 73.69 | 80.14 |
Le-Minh Nguyen, Akira Shimazu, Phuong-Thai Nguyen, Xuan-Hieu Phan | 72.54 | 80.91 |
Keith Hall, Jiří Havelka, David A. Smith | 72.27 | 78.47 |
Richard Johansson, Pierre Nugues | 70.98 | 77.39 |
Prashanth Reddy Mannem | 70.68 | 77.20 |
Maes | 67.38 | 74.03 |
Yu-Chieh Wu, Jie-Chi Yang, Yue-Shi Lee | 66.72 | 73.07 |
Sander Canisius, Erik Tjong Kim Sang | 56.14 | 72.12 |
Jia | 54.95 | 70.41 |
Svetoslav Marinov | 53.47 | 59.57 |
Daniel Zeman | 50.21 | 59.19 |
The CoNLL 2009 shared task focused on semantic role labeling but it also involved dependency parsing of 7 languages including Czech. Training and test data were taken from PDT 2.0.
For an overview of the results by the various teams, see Hajič et al. (2009) and also this site.
Authors | Labeled | Unlabeled |
---|---|---|
Andrea Gesmundo, James Henderson, Paola Merlo, Ivan Titov | 80.38 | |
Bernd Bohnet | 80.11 | |
Wanxiang Che, Zhenghua Li, Yongqiang Li, Yuhang Guo, Bing Qin, Ting Liu | 80.01 | |
Hai Zhao, Wenliang Chen, Jun'ichi Kazama, Kiyotaka Uchimoto, Kentaro Torisawa | 79.70 | |
Yotaro Watanabe, Masayuki Asahara, Yuji Matsumoto | 78.17 | |
Yi Zhang, Rui Wang, Stephan Oepen | 75.58 | |
Xavier Lluís, Stefan Bott, Lluís Màrquez | 75.00 | |
Brown | 73.29 | |
Buzhou Tang, Lu Li, Xinxin Li, Xuan Wang, Xiaolong Wang | 72.60 | |
Qifeng Dai, Enhong Chen, Liu Shi | 58.69 | |
Han Ren, Donhong Ji, Jing Wan, Mingyao Zhang | 57.30 | |
Daniel Zeman | 57.06 | |
Roser Morante, Vincent van Asch, Antal van den Bosch | 49.41 |
The following list of publications gives the picture of parsing results achieved within the ÚFAL research projects, as well as some relevant references to publications of authors at other sites.