MorphoDiTa User's Manual
In a natural language text, the task of morphological analysis is to assign for each token (word) in a sentence its lemma (cannonical form) and a part-of-speech tag (POS tag). This is usually achieved in two steps: a morphological dictionary looks up all possible lemmas and POS tags for each word, and subsequently, a morphological tagger picks for each word the best lemma-POS tag candidate. The second task is called a disambiguation.
MorphoDiTa also performs these two steps of morphological analysis: It first outputs all possible pairs of lemma and POS tag for each token. Consequently, the optimal combination of lemmas and POS tags is selected for the words in a sentence using an algorithm described in Spoustová et al. 2009.
Like any supervised machine learning tool, MorphoDiTa needs a trained linguistic model. This section describes the available language models and also the commandline tools and interfaces. The C++ library is described elsewhere, either in MorphoDiTa API Tutorial or in MorphoDiTa API Reference.
1. Czech MorfFlex2+PDT-C Models
Czech models are distributed under the CC BY-NC-SA licence. The Czech morphology uses the MorfFlex CZ 2.0 Czech morphological dictionary and the Czech tagger is trained on PDT-C 1.0. The morpholodical derivator is uses the DeriNet 2.1. Czech models work in MorphoDiTa version 1.9 or later.
Apart from MorfFlex CZ dictionary, a prefix guesser and statistical guesser are implemented and can be optionally used when performing morphological analysis.
1.1. Download
The latest version 220710 of the Czech MorfFlex+PDT models can be downloaded from LINDAT/CLARIN repository.
1.2. Acknowledgements
This work has been has been supported by the LINDAT/CLARIAH-CZ project funded by Ministry of Education, Youth and Sports of the Czech Republic (project LM2018101).
1.2.1. Publications
- (Straková et al., 2014) Straková Jana, Straka Milan and Hajič Jan. Open-Source Tools for Morphology, Lemmatization, POS Tagging and Named Entity Recognition. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 13-18, Baltimore, Maryland, June 2014. Association for Computational Linguistics.
- (Jonáš Vidra et al., 2019) Jonáš Vidra, Zdeněk Žabokrtský, Magda Ševčíková, Lukáš Kyjánek. Towards an All-in-One Word-Formation Resource. In Proceedings of the Second Workshop on Resources and Tools for Derivational Morphology (DeriMo 2019). Prague, 2019, pp. 81-89.
- (Jan Hajič et al., 2020) Jan Hajič, Eduard Bejček, Jaroslava Hlavacova, Marie Mikulová, Milan Straka, Jan Štěpánek, and Barbora Štěpánková. Prague Dependency Treebank - Consolidated 1.0. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 5208–5218, Marseille, France. European Language Resources Association.
- (Marie Mikulová et al., 2022) Mikulová Marie, Hajič Jan, Hana Jiří, Hanová Hana, Hlaváčová Jaroslava, Jeřábek Emil, Štěpánková Barbora, Vidová Hladká Barbora, Zeman Daniel. Manual for Morphological Annotation, Revision for the Prague Dependency Treebank - Consolidated 2020 release. Technical report no. TR-2020-64, Institute of Formal and Applied Linguistics, Charles University, Prague, Czechia, 2020.
1.3. MorfFlex CZ 2.0 Morphological System
The MorfFlex CZ 2.0 uses a so-called PDT-C tag set, which is an evolution
of the original PDT tag set devised by Jan Hajič
(Hajič 2004).
The tags are positional with 15 positions corresponding to part of speech,
detailed part of speech, gender, number, case, etc. (e.g. NNFS1-----A----
).
Different meanings of same lemmas are distinguished and additional comments can
be provided for every lemma meaning. The lemma itself without the comments and
meaning specification is called a raw lemma. The following examples
illustrate this:
Japonsko_;G
(raw lemma:Japonsko
)se_^(zvr._zájmeno/částice)
(raw lemma:se
)tvořit_:T
(raw lemma:tvořit
)
The complete reference can be found in the Manual for Morphological Annotation, Revision for the Prague Dependency Treebank - Consolidated 2020 release.
1.4. PDT-C 1.0 Train/Dev/Test Split
The PDT-C corpus consists of four datasets, but some of them do not have an official train/dev/test split. We therefore used the following split:
- PDT dataset is already split into train, dev (
dtest
), and test (etest
). - PCEDT dataset is a translated version of the Wall Street Journal, so we used the usual split into train (sections 0-18), dev (sections 19-21), and test (sections 22-24).
- PDTSC and FAUST datasets have no split, so we split it into dev (documents with identifiers ending with 6), test (documents with identifiers ending with 7), and train (all the remaining documents).
1.5. Model Variants
Apart from the primary model, which predicts all the 15 tag positions and processed texts with diacritics, we also provide several variants:
pos_only
: Instead of all 15 tag positions, the model predicts only the first 2, which contain the coarse and detailed POS, plus the full lemma, while being circa 15 times faster than the primary model.no_dia
,no_dia-pos_only
: The forms (during morphological analysis, generation, and tagging) have the diacritical marks stripped; however, the lemmas do include them. Useful for processing texts without diacritics.
1.6. Model Performance
Tags | Lemmas | Performance | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Model | PDT | PCEDT | PDTSC | Faust | Macro Avg |
PDT | PCEDT | PDTSC | Faust | Macro Avg |
Speed | Size |
czech-morfflex2.0-pdtc1.0-220710 |
96.29 | 97.00 | 96.90 | 94.87 | 96.27 | 98.69 | 98.85 | 98.18 | 97.53 | 98.31 | 19k toks/s | 24.4MB |
czech-morfflex2.0-pdtc1.0-220710-pos_only |
98.99 | 99.12 | 98.45 | 97.85 | 98.60 | 98.50 | 98.63 | 98.09 | 97.05 | 98.07 | 253k toks/s | 9.5MB |
czech-morfflex2.0-pdtc1.0-220710-no_dia |
95.57 | 96.13 | 96.40 | 93.46 | 95.39 | 97.88 | 98.20 | 97.67 | 96.57 | 97.58 | 11k toks/s | 30.4MB |
czech-morfflex2.0-pdtc1.0-220710-no_dia-pos_only |
98.55 | 98.73 | 98.07 | 97.31 | 98.17 | 97.60 | 97.85 | 97.52 | 95.98 | 97.24 | 177k toks/s | 14.5MB |
2. Czech MorfFlex+PDT Models
Czech models are distributed under the CC BY-NC-SA licence. The Czech morphology uses the MorfFlex CZ 161115 Czech morphological dictionary and the Czech tagger is trained on PDT 3.0. The morpholodical derivator is uses the DeriNet 1.2. Czech models work in MorphoDiTa version 1.9 or later.
Apart from MorfFlex CZ dictionary, a prefix guesser and statistical guesser are implemented and can be optionally used when performing morphological analysis.
Czech models are versioned according to the version of the MorfFlex CZ
morphological dictionary used, the version format is YYMMDD
, where YY
,
MM
and DD
are two-digit representation of year, month and day,
respectively. The latest version is 161115.
Compared to Featurama http://sourceforge.net/projects/featurama/ (state-of-the-art Czech tagger implementation), the models are 5 times faster and 10 times smaller.
2.1. Download
The latest version 161115 of the Czech MorfFlex+PDT models can be downloaded from LINDAT/CLARIN repository.
2.1.1. Previous Versions
- Version 160310 of the Czech MorphoDiTa models can be downloaded from LINDAT/CLARIN repository.
- Version 131112 of the Czech MorphoDiTa models can be downloaded from LINDAT/CLARIN repository.
2.2. Acknowledgements
This work has been using language resources developed and/or stored and/or distributed by the LINDAT/CLARIN project of the Ministry of Education of the Czech Republic (project LM2010013).
The Czech morphological system was devised by Jan Hajič.
The MorfFlex CZ dictionary was created by Jan Hajič and Jaroslava Hlaváčová.
The morphological guesser research was supported by the projects 1ET101120503 and 1ET101120413 of Academy of Sciences of the Czech Republic and 100008/2008 of Charles University Grant Agency. The research was performed by Jan Hajič, Jaroslava Hlaváčová and David Kolovratník.
The tagger algorithm and feature set research was supported by the projects MSM0021620838 and LC536 of Ministry of Education, Youth and Sports of the Czech Republic, GA405/09/0278 of the Grant Agency of the Czech Republic and 1ET101120503 of Academy of Sciences of the Czech Republic. The research was performed by Drahomíra "johanka" Spoustová, Jan Hajič, Jan Raab and Miroslav Spousta.
The tagger is trained on morphological layer of Prague Dependency Treebank PDT 3.0, which was supported by the projects LM2010013, LC536, LN00A063 and MSM0021620838 of Ministry of Education, Youth and Sports of the Czech Republic, and developed by Martin Buben, Jan Hajič, Jiří Hana, Hana Hanová, Barbora Hladká, Emil Jeřábek, Lenka Kebortová, Kristýna Kupková, Pavel Květoň, Jiří Mírovský, Andrea Pfimpfrová, Jan Štěpánek and Daniel Zeman.
The morphological derivator is based on DeriNet, which was supported by the Grant No. 16-18177S of the Grant Agency of the Czech Republic and uses language resources developed, stored, and distributed by the LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project LM2015071).
2.2.1. Publications
- (Hajič 2004) Jan Hajič. Disambiguation of Rich Inflection: Computational Morphology of Czech. Karolinum Press (2004).
- Hlaváčová Jaroslava, Kolovratník David. Morfologie češtiny znovu a lépe. In Informačné Technológie - Aplikácie a Teória. Zborník príspevkov, ITAT 2008. Seňa, Slovakia: PONT s.r.o., 2008, pp. 43-47.
- (Spoustová et al. 2009) Drahomíra "johanka" Spoustová, Jan Hajič, Jan Raab, Miroslav Spousta. 2009. Semi-Supervised Training for the Averaged Perceptron POS Tagger. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), pages 763-771, Athens, Greece, March. Association for Computational Linguistics.
- (Straková et al. 2014) Straková Jana, Straka Milan and Hajič Jan. Open-Source Tools for Morphology, Lemmatization, POS Tagging and Named Entity Recognition. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 13-18, Baltimore, Maryland, June 2014. Association for Computational Linguistics.
- (Žabokrtský et al. 2016) Zdeněk Žabokrtský, Magda Ševčíková, Milan Straka, Jonáš Vidra and Adéla Limburská. Merging Data Resources for Inflectional and Derivational Morphology in Czech. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Portorož, Slovenia, May 2016.
2.3. Czech Morphological System
In the Czech language, MorphoDiTa uses Czech morphological system by
Jan Hajič (Hajič 2004).
In this system, which we call PDT tag set, the tags are positional with 15
positions corresponding to part of speech, detailed part of speech, gender,
number, case, etc. (e.g. NNFS1-----A----
). Different meanings of same
lemmas are distinguished and additional comments can be provided for every
lemma meaning. The lemma itself without the comments and meaning specification
is called a raw lemma. The following examples illustrate this:
Japonsko_;G
(raw lemma:Japonsko
)se_^(zvr._zájmeno/částice)
(raw lemma:se
)tvořit_:T
(raw lemma:tvořit
)
For a more detailed reference about the Czech morphology, please see Lemma and Tag Structure in PDT 2.0.
2.4. Main Czech Model
The main Czech model contains the following files:
czech-morfflex-161115.dict
- Morphological dictionary based on the Jan Hajič's (Hajič 2004) system with PDT tag set created from MorfFlex CZ 161115 morphological dictionary and DeriNet 1.2.
czech-morfflex-pdt-161115.tagger
-
Tagger trained on the training portion
of PDT 3.0 using the
neopren
feature set. It contains theczech-morfflex-161115.dict
morphological dictionary. and reaches 95.55% tag accuracy, 97.86% lemma accuracy and 95.06% overall accuracy on PDT 3.0 etest data (whose morphological tags and lemmas were remapped using theczech-morfflex-161115.dict
dictionary). Model speed: ~15k words/s, model size: 18MB.
2.5. Part of Speech Only Variant
The PDT tag set used by the main Czech model is very fine-grained. In many
situations, only the part of speech tags would be sufficient. Therefore, we
provide a variant of the model, denoted as pos_only
, where only the first
two characters of the fifteen-letter tags are used, representing the part of
speech and detailed part of speech, respectively. There are 67 such two-letter tags.
czech-morfflex-161115-pos_only.dict
- Morphological dictionary based on the Jan Hajič's (Hajič 2004) system created from MorfFlex CZ 161115 morphological dictionary and DeriNet 1.2. Only first two tag characters of PDT tag set are used.
czech-morfflex-pdt-161115-pos_only.tagger
-
Very fast tagger trained on the training portion of
PDT 3.0 using the
neopren
feature set. It contains theczech-morfflex-161115-pos_only.dict
morphological dictionary and reaches 99.01% tag accuracy, 97.69% lemma accuracy and 97.66% overall accuracy on PDT 3.0 etest data (which morphological tags and lemmas were remapped using theczech-morfflex-161115-pos_only.dict
dictionary). Model speed: ~250k words/s, model size: 5MB.
2.6. No Diacritical Marks Variant
Sometimes the text to be analyzed does not contain diacritical marks. We therefore provide variants of the morphological dictionary and tagger for this purpose – morphological analysis, morphological generation and tagging employ forms without diacritical marks. Note that the lemmas do have diacritical marks.
We provide the no_dia
variants for all four models described above:
czech-morfflex-161115-no_dia.dict
-
No diacritical marks variant of
czech-morfflex-161115.dict
. czech-morfflex-pdt-161115-no_dia.tagger
-
No diacritical marks variant of
czech-morfflex-161115.tagger
. It reaches 94.69% tag accuracy, 97.06% lemma accuracy and 93.84% overall accuracy on PDT 3.0 etest data (which morphological tags and lemmas were remapped using theczech-morfflex-161115-no_dia.dict
dictionary) with diacritical marks removed. Model speed: ~7.5k words/s, model size: 22MB. czech-morfflex-161115-no_dia-pos_only.dict
-
No diacritical marks variant of
czech-morfflex-161115-pos_only.dict
. czech-morfflex-pdt-161115-no_dia-pos_only.tagger
-
No diacritical marks variant of
czech-morfflex-161115-pos_only.tagger
. It reaches 98.55% tag accuracy, 97.07% lemma accuracy and 97.02% overall accuracy on PDT 3.0 etest data (which morphological tags and lemmas were remapped using theczech-morfflex-161115-no_dia-pos_only.dict
dictionary) with diacritical marks removed. Model speed: ~125k words/s, model size: 11MB.
2.7. Models with Raw Lemmas
The Czech morphological system distinguish different meanings of same lemmas by numbering the lemmas with multiple meanings and supplying additional comments for every lemma meaning, as described and demonstrated in Czech Morphological System. Sometimes this may be undesirable, for example when comparing to systems which do not use the MorfFlex CZ morphological dictionary.
To obtain lemmas without any additional information (raw lemmas in terms of
MorphoDiTa API), use strip_lemma_id
tag set converter. Previously,
specific dictionary and tagger model variants were provided, which is not needed
anymore.
2.8. Czech Model History
czech-morfflex-161115
andczech-morfflex-pdt-161115
(require MorphoDiTa 1.9 or later)- Trained on PDT 3.0 using MorfFlex CZ 161115 and DeriNet 1.2, variants: Part of Speech Only, No Diacritical Marks. Download from LINDAT/CLARIN repository.
czech-morfflex-160310
andczech-morfflex-pdt-160310
(require MorphoDiTa 1.0 or later)- Trained on PDT 3.0 using MorfFlex CZ 160310, variants: Part of Speech Only, No Diacritical Marks. Download from LINDAT/CLARIN repository.
czech-morfflex-131112
andczech-morfflex-pdt-131112
(require MorphoDiTa 1.0 or later)- Trained on PDT 2.5 using MorfFlex CZ 131112, variants Part of Speech Only, Raw Lemmas. Download from LINDAT/CLARIN repository.
3. Slovak MorfFlex+PDT Models
Slovak models are distributed under the CC BY-NC-SA licence. The Slovak morphology uses the MorfFlex SK 170914 Slovak morphological dictionary and the Slovak tagger is trained on automatically translated PDT 3.0. Slovak models work in MorphoDiTa version 1.9 or later.
Apart from MorfFlex SK dictionary, a statistical guesser is implemented and can be optionally used when performing morphological analysis.
Slovak models are versioned according to the version of the MorfFlex SK
morphological dictionary used, the version format is YYMMDD
, where YY
,
MM
and DD
are two-digit representation of year, month and day,
respectively. The latest version is 170914.
3.1. Download
The latest version 170914 of the Slovak MorfFlex+PDT models can be downloaded from LINDAT/CLARIN repository.
3.2. Acknowledgements
This work has also been supported by the LINDAT/CLARIAH-CZ Research Infrastructure (https://lindat.cz), supported by the Ministry of Education, Youth and Sports of the Czech Republic (Project No. LM2018101). It has also been using language resources developed and stored and distributed by the LINDAT/CLARIN project of the Ministry of Education of the Czech Republic (project LM2015071).
The Czech morphological system was devised by Jan Hajič.
The MorfFlex SK dictionary was created by Jan Hajič and Jan Hric.
The tagger algorithm and feature set research was supported by the projects MSM0021620838 and LC536 of Ministry of Education, Youth and Sports of the Czech Republic, GA405/09/0278 of the Grant Agency of the Czech Republic and 1ET101120503 of Academy of Sciences of the Czech Republic. The research was performed by Drahomíra "johanka" Spoustová, Jan Hajič, Jan Raab and Miroslav Spousta.
The tagger is trained on morphological layer of Prague Dependency Treebank PDT 3.0, which was supported by the projects LM2010013, LC536, LN00A063 and MSM0021620838 of Ministry of Education, Youth and Sports of the Czech Republic, and developed by Martin Buben, Jan Hajič, Jiří Hana, Hana Hanová, Barbora Hladká, Emil Jeřábek, Lenka Kebortová, Kristýna Kupková, Pavel Květoň, Jiří Mírovský, Andrea Pfimpfrová, Jan Štěpánek and Daniel Zeman.
3.2.1. Publications
- (Hajič 2004) Jan Hajič. Disambiguation of Rich Inflection: Computational Morphology of Czech. Karolinum Press (2004).
- (Spoustová et al. 2009) Drahomíra "johanka" Spoustová, Jan Hajič, Jan Raab, Miroslav Spousta. 2009. Semi-Supervised Training for the Averaged Perceptron POS Tagger. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), pages 763-771, Athens, Greece, March. Association for Computational Linguistics.
- (Straková et al. 2014) Straková Jana, Straka Milan and Hajič Jan. Open-Source Tools for Morphology, Lemmatization, POS Tagging and Named Entity Recognition. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 13-18, Baltimore, Maryland, June 2014. Association for Computational Linguistics.
3.3. Slovak Morphological System
In the Slovak language, MorphoDiTa uses the same morphological system as Czech.
3.4. Main Slovak Model
The main Slovak model contains the following files:
slovak-morfflex-170914.dict
- Morphological dictionary based on the Jan Hajič's (Hajič 2004) system with PDT tag set created from MorfFlex SK 170914 morphological dictionary.
slovak-morfflex-pdt-170914.tagger
-
Tagger trained on the training portion
of automatically translated PDT 3.0 using
the
neopren
feature set. It contains theslovak-morfflex-170914.dict
morphological dictionary. and reaches 92.8% tag accuracy, 96.3% lemma accuracy and 92.0% overall accuracy on PDT 3.0 etest data (whose morphological tags and lemmas were remapped using theslovak-morfflex-170914.dict
dictionary). Model speed: ~5k words/s, model size: 17MB.
3.5. Part of Speech Only Variant
The PDT tag set used by the main Slovak model is very fine-grained. In many
situations, only the part of speech tags would be sufficient. Therefore, we
provide a variant of the model, denoted as pos_only
, where only the first
two characters of the fifteen-letter tags are used, representing the part of
speech and detailed part of speech, respectively. There are 67 such two-letter tags.
slovak-morfflex-170914-pos_only.dict
- A variant of `slovak-morfflex-170914.dict`, where only the first two tag characters are used.
slovak-morfflex-pdt-170914-pos_only.tagger
-
Very fast variant of
slovak-morfflex-170914.tagger
predicting only two-character tags. It reaches 98.3% tag accuracy, 97.4% lemma accuracy and 96.8% overall accuracy on PDT 3.0 etest data (which morphological tags and lemmas were remapped using theslovak-morfflex-170914-pos_only.dict
dictionary). Model speed: ~200k words/s, model size: 4MB.
3.6. No Diacritical Marks Variant
Sometimes the text to be analyzed does not contain diacritical marks. We therefore provide variants of the morphological dictionary and tagger for this purpose – morphological analysis, morphological generation and tagging employ forms without diacritical marks. Note that the lemmas do have diacritical marks.
We provide the no_dia
variants for all four models described above:
slovak-morfflex-170914-no_dia.dict
-
No diacritical marks variant of
slovak-morfflex-170914.dict
. slovak-morfflex-pdt-170914-no_dia.tagger
-
No diacritical marks variant of
slovak-morfflex-170914.tagger
. It reaches 91.4% tag accuracy, 92.8% lemma accuracy and 89.0% overall accuracy on PDT 3.0 etest data (which morphological tags and lemmas were remapped using theslovak-morfflex-170914-no_dia.dict
dictionary) with diacritical marks removed. Model speed: ~5k words/s, model size: 18MB. slovak-morfflex-170914-no_dia-pos_only.dict
-
No diacritical marks variant of
slovak-morfflex-170914-pos_only.dict
. slovak-morfflex-pdt-170914-no_dia-pos_only.tagger
-
No diacritical marks variant of
slovak-morfflex-170914-pos_only.tagger
. It reaches 97.5% tag accuracy, 93.9% lemma accuracy and 93.2% overall accuracy on PDT 3.0 etest data (which morphological tags and lemmas were remapped using theslovak-morfflex-170914-no_dia-pos_only.dict
dictionary) with diacritical marks removed. Model speed: ~200k words/s, model size: 7MB.
4. English Morphium+WSJ Models
English models are created using the following data:
- SCOWL (Spell Checker Oriented Word Lists): This word list is used in morphological generation to create all possible word forms of a given word. Copyright: Copyright 2000-2011 by Kevin Atkinson. Permission to use, copy, modify, distribute and sell these word lists, the associated scripts, the output created from the scripts, and its documentation for any purpose is hereby granted without fee, provided that the above copyright notice appears in all copies and that both that copyright notice and this permission notice appear in supporting documentation. Kevin Atkinson makes no representations about the suitability of this array for any purpose. It is provided "as is" without express or implied warranty.
- Wall Street Journal, part of the Penn Treebank 3: Morphologically annotated texts which are commonly used to train English POS tagger. Licensing: Available as LDC99T42 in LDC catalog under LDC User Agreement.
The resulting models are distributed under the CC BY-NC-SA licence. English models work in MorphoDiTa version 1.1 or later.
English models are versioned according to the release date, the version
format is YYMMDD
, where YY
, MM
and DD
are two-digit
representation of year, month and day, respectively. The latest version is
140407.
4.1. Download
The latest version 140407 of the English Morphium+WSJ models can be downloaded from LINDAT/CLARIN repository.
4.2. Acknowledgements
This work has been using language resources developed and/or stored and/or distributed by the LINDAT/CLARIN project of the Ministry of Education of the Czech Republic (project LM2010013).
The morphological POS analyzer development was supported by grant of the Ministry
of Education, Youth and Sports of the Czech Republic No. LC536 "Center for
Computational Linguistics". The morphological POS analyzer research was
performed by Johanka Spoustová (Spoustová 2008; the Treex::Tool::EnglishMorpho::Analysis
Perl module). The lemmatizer was implemented by Martin Popel (Popel 2009; the
Treex::Tool::EnglishMorpho::Lemmatizer
Perl module). The lemmatizer is
based on morpha
, which was released under LGPL licence as a part of
RASP system.
The tagger algorithm and feature set research was supported by the projects MSM0021620838 and LC536 of Ministry of Education, Youth and Sports of the Czech Republic, GA405/09/0278 of the Grant Agency of the Czech Republic and 1ET101120503 of Academy of Sciences of the Czech Republic. The research was performed by Drahomíra "johanka" Spoustová, Jan Hajič, Jan Raab and Miroslav Spousta.
4.2.1. Publications
- (Popel 2009) Martin Popel. Ways to Improve the Quality of English-Czech Machine Translation. Master Thesis at Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University in Prague (2009).
- (Spoustová 2008) Drahomíra "johanka" Spoustová. Morphium – morphological analyser for Penn treebank POS tagset. Perl Software developed at Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University in Prague (2008).
- (Spoustová et al. 2009) Drahomíra "johanka" Spoustová, Jan Hajič, Jan Raab, Miroslav Spousta. 2009. Semi-Supervised Training for the Averaged Perceptron POS Tagger. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), pages 763-771, Athens, Greece, March. Association for Computational Linguistics.
- (Straková et al. 2014) Straková Jana, Straka Milan and Hajič Jan. Open-Source Tools for Morphology, Lemmatization, POS Tagging and Named Entity Recognition. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 13-18, Baltimore, Maryland, June 2014. Association for Computational Linguistics.
4.3. English Morphological System
The English morphology uses standard Penn Treebank POS tags. Nevertheless, the lemma structure is unique:
- The lemmatizer recognizes negative prefixes and removes it from the lemma. In terms of MorphoDiTa API, raw lemma is the lemma without negative prefix.
- The negative prefix is also stored to allow morphological generation of word form with the same negative prefix. In terms of MorphoDiTa API, lemma id is the raw lemma plus the negative prefix.
The negative prefix is separated from the (always nonempty) lemma using a ^
character (able^un
). During morphological generation, the negative prefix is honored.
Furthermore, when the lemma ends with ^
(i.e., negative prefix is empty, as
in able^
), forms with negative prefixes are generated. It is also possible
to generate all forms without any negative prefix by appending +
after the lemma
(for example able+
).
4.4. English Model
The English model contains the following files:
english-morphium-<version>.dict
- Morphological dictionary. The SCOWL word list has been automatically analyzed and lemmatized and uses as the dictionary. The guesser performing the analyzation and lemmatization is available.
english-morphium-wsj-<version>.tagger
-
Tagger trained on the training portion of Wall Street Journal (Sections 0-18)
and tuned on the development portion (Sections 19-21). Contains the
english-morphium-<version>.dict
morphological dictionary. The latest versionenglish-morphium-wsj-140407.tagger
reaches 97.27% tag accuracy on Wall Street Journal test portion (Section 22-24). Model speed: ~60k words/s, model size: 6MB.
4.5. No Negations Variant
Stripping of negative prefixes (or handling the lemmas with negative prefixes
stripped) may not be desirable. Therefore, a variant of the English model
denoted by no_negation
is provided, which does not strip negative prefixes
from lemmas.
english-morphium-<version>-no_negation.dict
- Morphological dictionary which does not strip negative lemma prefixes. The SCOWL word list has been automatically analyzed and lemmatized and uses as the dictionary. The guesser performing the analyzation and lemmatization is available.
english-morphium-wsj-<version>-no_negation.tagger
-
Tagger which does not strip negative lemma prefixes, trained on the training
portion of Wall Street Journal (Sections 0-18) and tuned on the development
portion (Sections 19-21). Contains the
english-morphium-<version>-no_negation.dict
morphological dictionary. The latest versionenglish-morphium-wsj-140407-no_negation.tagger
reaches 97.25% tag accuracy on Wall Street Journal test portion (Section 22-24). Model speed: ~60k words/s, model size: 6MB.
4.6. English Model Changes
english-morphium-140407
andenglish-morphium-wsj-140407
(require MorphoDiTa 1.1 or later)- Recognize also "non-" as a negative prefix. Formerly, only "non" was recognized.
english-morphium-140304
andenglish-morphium-wsj-140304
(require MorphoDiTa 1.0 or later)- Initial release.
5. Running the Tagger
Probably the most common usage of MorphoDita is running a tagger to tag your data using
run_tagger tagger_model
The input is assumed to be in UTF-8 encoding and can be either already tokenized and segmented, or it can be a plain text which is tokenized and segmented automatically.
Any number of files can be specified after the tagger_model
. If an argument
input_file:output_file
is used, the given input_file
is processed and
the result is saved to output_file
. If only input_file
is used, the
result is saved to standard output. If no argument is given, input is read
from standard input and written to standard output.
The full command syntax of run_tagger
is
run_tagger [options] tagger_file [file[:output_file]]... Options: --input=untokenized|vertical --convert_tagset=pdt_to_conll2009|strip_lemma_comment|strip_lemma_id --derivation=none|root|path|tree --guesser=0|1 (should morphological guesser be used) --output=vertical|xml
5.1. Input Formats
The input format is specified using the --input
option. Currently supported
input formats are:
untokenized
(default): the input is tokenized and segmented using a tokenizer defined by the model,vertical
: the input is in vertical format, every line is considered a word, with empty line denoting end of sentence.
5.2. Tag Set Conversion
Some tag sets can be converted to different ones. Currently supported tag set conversions are:
pdt_to_conll2009
: convert Czech PDT tag set to CoNLL 2009 tag set,strip_lemma_comment
: strip lemma comment (see Lemma Structure in API Reference),strip_lemma_id
: strip lemma id (see Lemma Structure in API Reference).
5.3. Morphological Derivation
If the morphological model includes a morphological derivator, some morphological derivation operation may be performed on lemmas:
none
(default): no morphological derivation is performedroot
: lemma is replaced by its root in the morphological derivation treepath
: lemma is replaced by a space separated path to its root in the morphological derivation tree (the original lemma is first, followed by its parent, with the root being the last one)tree
: whole morphological derivation tree is appended after the lemma, encoded in the following way: root node is the first, then the subtrees of the root children are encoded recursively (each after one space), followed by a final space (which denotes that the children are complete)
5.4. Morphological Guesser
By default, every tagger model uses the morphological guesser settings employed
during the model training. However, the usage of morphological guesser can be
overridden by the guesser
parameter.
5.5. Output Formats
The output format is specified using the --output
option. Currently
supported output formats are:
xml
(default): Simple XML format without a root element, using<sentence>
element to mark sentences and<token lemma="..." tag="...">...</token>
element to encode token and its assigned lemma and tag. Example output for inputDěti pojedou k babičce. Už se těší.
(line breaks added):<sentence><token lemma='dítě' tag='NNFP1-----A----'>Děti</token> <token lemma='jet-1_^(pohybovat_se,_ne_však_chůzí)' tag='VB-P---3F-AA---'>pojedou</token> <token lemma='k-1' tag='RR--3----------'>k</token> <token lemma='babička' tag='NNFS3-----A----'>babičce</token> <token lemma='.' tag='Z:-------------'>.</token></sentence> <sentence><token lemma='už-1' tag='Db-------------'>Už</token> <token lemma='se_^(zvr._zájmeno/částice)' tag='P7-X4----------'>se</token> <token lemma='těšit_:T' tag='VB-S---3P-AA---'>těší</token> <token lemma='.' tag='Z:-------------'>.</token></sentence>
vertical
: Every output line is a tag separated triple form-lemma-tag, with empty line denoting end of sentence. Example output for inputDěti pojedou k babičce. Už se těší.
:Děti dítě NNFP1-----A---- pojedou jet-1_^(pohybovat_se,_ne_však_chůzí) VB-P---3F-AA--- k k-1 RR--3---------- babičce babička NNFS3-----A---- . . Z:------------- Už už-1 Db------------- se se_^(zvr._zájmeno/částice) P7-X4---------- těší těšit_:T VB-S---3P-AA--- . . Z:-------------
6. Running the Morphology
There are multiple commands performing morphological tasks.
The run_morpho_analyze
executable performs morphological analysis and
the run_morpho_generate
executable performs morphological generation.
The output of these commands is suitable for automatic processing.
The run_morpho_cli
executable performs both morphological analysis and generation,
but is designed to be used interactively and produces more human-readable output.
6.1. Morphological Analysis
The morphological analysis can be performed by running
run_morpho_analyze morphology_model use_guesser
The input is assumed to be in UTF-8 encoding and can be either already
tokenized and segmented, or it can be a plain text which is tokenized and
segmented automatically. The input files are specified same as with the
run_tagger
command.
Some morphological models contain both a manually created dictionary and
a guesser. Therefore, a numeric use_guesser
argument is required.
If non-zero, the guesser is used, otherwise not.
Because tagger models contain an embedded morphological model, a tagger model
can be used instead of morphological one if --from_tagger
option is
specified.
The full command syntax of run_morpho_analyze
is
run_morpho_analyze [options] morphology_model use_guesser [file[:output_file]]... Options: --input=untokenized|vertical --convert_tagset=pdt_to_conll2009|strip_lemma_comment|strip_lemma_id --derivation=none|root|path|tree --output=vertical|xml --from_tagger
6.1.1. Input Formats
The input format is specified using the --input
option. Currently supported
input formats are:
untokenized
(default): the input is tokenized and segmented using a tokenizer defined by the model,vertical
: the input is in vertical format, every line is considered a word, with empty line denoting end of sentence.
Note that the input data is also segmented, even if it is not strictly necessary. Therefore, the input is processed by whole paragraphs (ending by an empty line).
6.1.2. Tag Set Conversion
Some tag sets can be converted to different ones. Currently supported tag set conversions are:
pdt_to_conll2009
: convert Czech PDT tag set to CoNLL 2009 tag set,strip_lemma_comment
: strip lemma comment (see Lemma Structure in API Reference),strip_lemma_id
: strip lemma id (see Lemma Structure in API Reference).
6.1.3. Morphological Derivation
If the morphological model includes a morphological derivator, some morphological derivation operation may be performed on lemmas:
none
(default): no morphological derivation is performedroot
: lemma is replaced by its root in the morphological derivation treepath
: lemma is replaced by a space separated path to its root in the morphological derivation tree (the original lemma is first, followed by its parent, with the root being the last one)tree
: whole morphological derivation tree is appended after the lemma, encoded in the following way: root node is the first, then the subtrees of the root children are encoded recursively (each after one space), followed by a final space (which denotes that the children are complete)
6.1.4. Output Formats
The output format is specified using the --output
option. Currently
supported output formats are:
xml
(default): Simple XML format without a root element, using using<token><analysis lemma="..." tag="..."/><analysis...>...</token>
element to encode morphological analysis. Example output for inputDěti pojedou k babičce. Už se těší.
(line breaks added):<sentence><token><analysis lemma="dítě" tag="NNFP1-----A----"/><analysis lemma="dítě" tag="NNFP4-----A----"/><analysis lemma="dítě" tag="NNFP5-----A----"/>Děti</token> <token><analysis lemma="jet-1_^(pohybovat_se,_ne_však_chůzí)" tag="VB-P---3F-AA---"/>pojedou</token> <token><analysis lemma="k-1" tag="RR--3----------"/><analysis lemma="k-3_^(označení_pomocí_písmene)" tag="NNNXX-----A----"/><analysis lemma="k-4`kůň_:B_^(jednotka_výkonu)" tag="NNMXX-----A---8"/><analysis lemma="k-8_:B_^(ost._zkratka)" tag="XX------------8"/><analysis lemma="komanditní_:B_^(jen_komanditní_společnost)" tag="AAXXX----1A---8"/><analysis lemma="koncernový_:B" tag="AAXXX----1A---8"/><analysis lemma="kuo-1_:B_,t_^(stará_jednotka_výkonu)" tag="NNNXX-----A---8"/>k</token> <token><analysis lemma="babička" tag="NNFS3-----A----"/><analysis lemma="babička" tag="NNFS6-----A----"/>babičce</token> <token><analysis lemma="." tag="Z:-------------"/>.</token></sentence> <sentence><token><analysis lemma="už-1" tag="Db-------------"/><analysis lemma="už-2" tag="TT-------------"/>Už</token> <token><analysis lemma="se_^(zvr._zájmeno/částice)" tag="P7-X4----------"/><analysis lemma="s-1" tag="RV--2----------"/><analysis lemma="s-1" tag="RV--7----------"/>se</token> <token><analysis lemma="těšit_:T" tag="VB-P---3P-AA---"/><analysis lemma="těšit_:T" tag="VB-S---3P-AA---"/>těší</token> <token><analysis lemma="." tag="Z:-------------"/>.</token></sentence>
vertical
: Every output line contains a word and a tab separated lemma-tag pairs assigned to the input word, with empty line denoting end of sentence. Example output for inputDěti pojedou k babičce. Už se těší.
:Děti dítě NNFP1-----A---- dítě NNFP4-----A---- dítě NNFP5-----A---- pojedou jet-1_^(pohybovat_se,_ne_však_chůzí) VB-P---3F-AA--- k k-1 RR--3---------- k-3_^(označení_pomocí_písmene) NNNXX-----A---- k-4`kůň_:B_^(jednotka_výkonu) NNMXX-----A---8 k-8_:B_^(ost._zkratka) XX------------8 komanditní_:B_^(jen_komanditní_společnost) AAXXX----1A---8 koncernový_:B AAXXX----1A---8 kuo-1_:B_,t_^(stará_jednotka_výkonu) NNNXX-----A---8 babičce babička NNFS3-----A---- babička NNFS6-----A---- . . Z:------------- Už už-1 Db------------- už-2 TT------------- se se_^(zvr._zájmeno/částice) P7-X4---------- s-1 RV--2---------- s-1 RV--7---------- těší těšit_:T VB-P---3P-AA--- těšit_:T VB-S---3P-AA--- . . Z:-------------
6.2. Morphological Generation
The morphological generation can be performed by running
run_morpho_generate morphology_model use_guesser
The input is assumed to be in UTF-8 encoding. The input files are specified
same as with the run_tagger
command.
Input for morphological generation has to be in vertical format, each line containing a lemma, which can be optionally followed by a tab and a tag wildcard. The output has the same number of lines as input, line l contains tab separated form-lemma-tag triplets which can be generated from the lemma on he input line l. If a tag wildcard was provided, only triplets with matching tags are returned.
Some morphological models contain both a manually created dictionary and
a guesser. Therefore, a numeric use_guesser
argument is required.
If non-zero, the guesser is used, otherwise not.
Because tagger models contain an embedded morphological model, a tagger model
can be used instead of morphological one if --from_tagger
option is
specified.
The full command syntax of run_morpho_generate
is
run_morpho_generate [options] morphology_model use_guesser [input_file[:output_file]]... Options: --convert_tagset=pdt_to_conll2009|strip_lemma_comment|strip_lemma_id --from_tagger
Example input data:
dítě jet ?[fN]??[-1] k-1 babička NNFS3-----A----
Example output:
dítě dítě NNNS1-----A---- dítě dítě NNNS4-----A---- dítě dítě NNNS5-----A---- dítěte dítě NNNS2-----A---- dítěti dítě NNNS3-----A---- dítěti dítě NNNS6-----A---- dítětem dítě NNNS7-----A---- děti dítě NNFP1-----A---- děti dítě NNFP4-----A---- děti dítě NNFP5-----A---- dětma dítě NNFP7-----A---6 dětmi dítě NNFP7-----A---- dětem dítě NNFP3-----A---- dětí dítě NNFP2-----A---- dětech dítě NNFP6-----A---- dětima dítě_,h NNFP7-----A---6 ject jet Vf--------A---6 jet jet-1_^(pohybovat_se,_ne_však_chůzí) Vf--------A---- jeti jet-1_^(pohybovat_se,_ne_však_chůzí) Vf--------A---2 nejet jet-1_^(pohybovat_se,_ne_však_chůzí) Vf--------N---- nejeti jet-1_^(pohybovat_se,_ne_však_chůzí) Vf--------N---2 jet jet-2_,h_^(letadlo_s_tryskovým_pohonem)NNIS1-----A---- jety jet-2_,h_^(letadlo_s_tryskovým_pohonem) NNIP1-----A---- k k-1 RR--3---------- ke k-1 RV--3---------- ku k-1 RV--3---------1 babičce babička NNFS3-----A----
6.2.1. Tag Set Conversion
Some tag sets can be converted to different ones. Currently supported tag set conversions are:
pdt_to_conll2009
: convert Czech PDT tag set to CoNLL 2009 tag set,strip_lemma_comment
: strip lemma comment (see Lemma Structure in API Reference),strip_lemma_id
: strip lemma id (see Lemma Structure in API Reference).
Note that the tag set conversion is applied only to the output, not to the input lemmas and wildcards.
6.2.2. Tag Wildcards
When only forms with a specific tag should be generated for a given lemma, tag wildcard can be specified. The tag wildcard is a simple wildcard allowing to filter the results of morphological generation.
Most characters of a tag wildcard match corresponding characters of a tag, with the following exceptions:
?
matches any character of a tag.[chars]
matches any of the characters listed. The dash-
has no special meaning and if]
is the first character inchars
, it is considered as one of the characters and does not end the group.[^chars]
matches any of the characters not listed.
6.3. Interactive Morphological Analysis and Generation
Morphological analysis and generation which is interactive and more human readable can be run using:
run_morpho_cli morphology_model
The input is read from standard input, command on each line. If there is no tab on a line, analysis is performed on the given word. If there is a tab on a line, generation is performed on the first word, using the second word as a tag wildcard. If the second word is empty (i.e., the input is for example ``on ``), all forms are generated.
Because tagger models contain an embedded morphological model, a tagger model
can be used instead of morphological one if --from_tagger
option is
specified.
The full command syntax of run_morpho_cli
is
run_morpho_cli [options] morphology_model Options: --from_tagger
7. Running the Tokenizer
Using the run_tokenizer
executable it is possible to perform only
tokenization and segmentation.
The input is a UTF-8 encoded plain text and the input files are specified same
as with the run_tagger
command.
The tokenizer can be specified either by using a morphology model
(--morphology
option), tagger model (--tagger
option) or by using
a tokenizer identifier (--tokenizer
option). Currently supported
tokenizer identifiers are:
czech
english
generic
The full command syntax of run_tokenizer
is
run_tokenizer [options] [file[:output_file]]... Options: --tokenizer=czech|english|generic --morphology=morphology_model_file --tagger=tagger_model_file --output=vertical|xml
7.1. Output Formats
The output format is specified using the --output
option. Currently
supported output formats are:
xml
(default): Simple XML format without a root element, using<sentence>
element to mark sentences and<token>
element to mark tokens. Example output for inputDěti pojedou k babičce. Už se těší.
(line breaks added):<sentence><token>Děti</token> <token>pojedou</token> <token>k</token> <token>babičce</token><token>.</token></sentence> <sentence><token>Už</token> <token>se</token> <token>těší</token><token>.</token></sentence>
vertical
: Each token is on a separate line, every sentence is ended by a blank line. Example output for inputDěti pojedou k babičce. Už se těší.
:Děti pojedou k babičce . Už se těší .
8. Running REST Server
MorphoDiTa also provides REST server binary morphodita_server
.
The binary uses MicroRestD as a REST
server implementation and provides
MorphoDiTa REST API.
The full command syntax of morphodita_server
is
morphodita_server [options] port (model_name model_file acknowledgements)* Options: --connection_timeout=maximum connection timeout [s] (default 60) --daemon (daemonize after start, supported on Linux only) --log_file=file path (no logging if empty, default morphodita_server.log) --log_request_max_size=max req log size [kB] (0 unlimited, default 64) --max_connections=maximum network connections (default 256) --max_request_size=maximum request size [kB] (default 1024) --threads=threads to use (default 0 means unlimitted)
The morphodita_server
can run either in foreground or in background (when
--daemon
is used). The specified model files are loaded during start and
kept in memory all the time. This behaviour might change in future to load the
models on demand.
9. Custom Morphological and Tagging Models
It is possible to create custom morphological and tagging models.
9.1. Custom Morphological Models
Custom morphological models can be created using encode_dictionary
binary.
The encode_dictionary
reads from standard input and prints MorphoDiTa
morphological model on standard output. The input of encode_dictionary
is
a textual representation of morphological dictionary. It should be UTF-8
encoded and every line should be a tab separated triplet
lemma \t tag \t form
. All forms of one lemma must appear in a continuous region and no line
should appear more than once (sort -u
can be used to achieve this).
Run encode_dictionary
with the following options:
encode_dictionary generic max_suffix_len unknown_tag number_tag punctuation_tag symbol_tag [statistical_guesser]
generic
: This parameter defines tokenizer and other language specific behaviour. Other values thangeneric
take different options and are not documented.max_suffix_len
: Maximum length of suffixes in automatically inferred inflexion classes. If unsure, use 8 (we use 8 for Czech and 4 for English). Smaller values produce larger and slightly faster models.unknown_tag
: Assigned to a form during analysis if no matching tag can be found.number_tag
: Assigned to a form during analysis if the form was not found in the dictionary and it looks like a number. Can be the same asunknown_tag
.punctuation_tag
: Assigned to a form during analysis if the form was not found in the dictionary and it consists of Unicode characters in the Punctuation category. Can be the same asunknown_tag
.symbol_tag
: Assigned to a form during analysis if the form was not found in the dictionary and it consists of Unicode characters in the Symbol category. Can be the same asunknown_tag
.statistical_guesser
: Optional file containing statistical guesser generated using thetrain_guesser
binary (see below).
Example input data:
dog NN dog dog NNS dogs go VB go go VBP go go VBZ goes go VBG going go VBD went
Example command line:
encode_dictionary generic 8 UNK NUM PUNC SYM <input_data >output_model
9.1.1. Training Statistical Guesser
Optionally, statistical guesser might be trained on disambiguated
data using the train_guesser
binary.
The input data is in the same format as the training data for the tagger,
i.e., every word on a line (each line containing tab separated triplet
form \t lemma \t tag
in UTF-8 encoding), with end of sentence denoted
by an empty line. Note that the input data must not contain spaces.
The full command syntax of train_guesser
is:
train_guesser [options] suffix_len rules_per_suffix <input_data >output_guesser Options: --max_prefixes=maximum number of prefixes to create --min_prefix_count=minimum count to create a prefix
suffix_len
: Generate guesser rules using suffixes of length at mostsuffix_len
(for Czech we use 3).rules_per_suffix
: Maximum number of guesser rules generated per suffix (for Czech we use 8 for a rich tag set (more than a thousand tags) and 6 for a coarse tag set (67 tags)).max_prefixes
: The guesser rules might also be specific for several prefixes. There might be at mostmax_prefixes
such prefixes. Note that the more prefixes are allowed, the large the guesser is (for Czech we set this to 0, but for some other languages we also use 4).min_prefix_count
: In order for a prefix to be considered, it has to occur at most the specified number of times in the data.
9.1.2. Using External Morphology
Sometimes it is useful to train MorphoDiTa tagger using external morphological analysis, without having a MorphoDiTa morphological dictionary.
That is possible using a so called external morphology model. External morphology model can be created easily using
encode_dictionary external unknown_tag >output_model
No standard input is read in this case. The unknown_tag
parameter is used when
no tag is assigned to a word form during analysis. The resulting model is
printed on standard output.
The external morphology model does not contain any morphological dictionary.
Instead, it expects the user to perform morphological analysis and generation on
their own. Therefore, the input form to analysis is expected to be followed by
space separated lemma-tag pairs, which are returned by the analysis.
Similarly, the input lemma to generation is expected to be followed by space
separated form-tag pairs, which are again returned by the generation (possibly
filtered by a tag wildcard). (To extract the length of the form or lemma itself
even when followed by external analyses, API calls raw_form_len
or
raw_lemma_len
and lemma_id_len
can be used.)
Note that the tokenizer returned by the external morphology model is the same as the tokenizer of the generic model, and splits input on spaces. Therefore, it can be used to tokenize input, the tokens then passed to the external morphology, and the results can be after proper formatting used as input to MorphoDiTa in vertical input format.
Example input form for analysis using external morphology model:
wishes wish NNS wish VBZ
Example input lemma for generation using external morphology model:
go go VB go VBP goes VBZ going VBG went VBG
9.2. Custom Tagging Models
Custom tagging models can be trained using train_tagger
binary, which has
the following options:
train_tagger generic_234 morphology use_guesser features iterations prune_features [heldout_data [early_stopping]] <input_data >tagger_model
generic_234
: This parameter defines the tagger (elementary features and algorithm) and the order of Viterbi decoding. Use eithergeneric2
,generic3
orgeneric4
. If unsure, usegeneric3
(best released Czech and English models usegeneric3
). Thegeneric2
produces faster, but less accurate models,generic4
produces larger and only marginally better models.morphology
: File with the morphological dictionary to use.use_guesser
: Use0
/1
to specify whether morphological guesser should be used. Unless you have a good reason not to, use1
.features
: File with feature sequences for the tagger. The file format and available elementary features are described in following section.iterations
: Number of training iterations. For English, values 5-10 are used, for Czech, values 10-15 are used. Can be affected byearly_stopping
.prune_features
: Use0
/1
to disable/enable pruning of feature sequences not found in training data. Use1
for smaller and marginally less accurate models, and0
for larger and marginally better models. If unsure, use1
(best released Czech and English models use1
).heldout_data
: Optional file with heldout data in the same format as input data. If supplied, accuracy is measured on the heldout data after every training iteration.early_stopping
: Optionally use0
/1
to disable/enable early stopping. If early stopping is enabled, the resulting model is not the one after the last training iteration, but the one with best heldout data accuracy.
Example command line (use morphology from morpho.dict
, features from features.ft
and no heldout data):
train_tagger generic3 morpho.dict 1 features.ft 10 1 <input.data >tagger.model
Example command line (use morphology from morpho.dict
, features from features.ft
and use heldout data with early stopping):
train_tagger generic3 morpho.dict 1 features.ft 15 1 heldout.data 1 <input.data >tagger.model
See next sections for examples of input data and feature files.
9.2.1. Input Data Format
The input data (and the heldout data) represent a sequence of sentences.
Different sentences do not interact in any way. Words of one sentence are
stored on consecutive lines, each line containing tab separated triplet
form \t lemma \t tag
in UTF-8 encoding. End of sentence is denoted
by an empty line.
Example:
Děti dítě NNFP1-----A---- pojedou jet-1_^(pohybovat_se,_ne_však_chůzí) VB-P---3F-AA--- k k-1 RR--3---------- babičce babička NNFS3-----A---- . . Z:------------- Už už-1 Db------------- se se_^(zvr._zájmeno/částice) P7-X4---------- těší těšit_:T VB-S---3P-AA--- . . Z:-------------
9.2.2. Feature File Format
The features used in the tagger have major influence on tagging performance.
The feature file contains several feature sequences, each sequence
consisting of several elementary features. The elementary features are
computed by MorphoDiTa and different tagger models can have a different set of
elementary features. Here we describe elementary features of generic
tagger:
Form
: word formPrefix1
..Prefix9
: word form prefix of length 1..9 (measured in Unicode characters)Suffix1
..Suffix9
: word form suffix of length 1..9 (measured in Unicode characters)Num
: whether the word form contains at least one numbers (Unicode category Number)Cap
: whether the word form contains at least one uppercase or titlecase letterDash
: whether the word form contains at least one dash (Unicode category 'Punctuation, Dash')Tag
: word form PoS tagTag1
..Tag5
: letter 1..5 of word form PoS tagLemma
: word form lemmaFollowingVerbTag
: PoS tag of a nearest following verb, i.e., a nearest following word form with at least one of the PoS tags starting withV
FollowingVerbLemma
: lemma of a nearest following verb, i.e., a nearest following word form with at least one of the PoS tags starting withV
PreviousVerbTag
: PoS tag of a nearest previous verb, i.e., a nearest previous word whose PoS tag (assigned by the tagger) starts withV
PreviousVerbTag
: lemma of a nearest previous verb, i.e., a nearest previous word whose PoS tag (assigned by the tagger) starts withV
The feature file defines feature sequences which can be applied to a word form. A feature sequence consists of elementary features assigned to the given form or its neighbours.
Every line in the feature file defines one feature sequence. A feature sequence
consists of comma joined space separated pairs of elementary feature and an
offset to which does the elementary feature apply (i.e., Form 0
or
Tag 0,Lemma -1
). The file format is strict and does not allow any
additional spaces or commas.
Note that offset of some of the elementary features is affected by the order or
Viterbi decoding used. Notably, if Viterbi decoding of order N is utilized,
Tag
and Lemma
can be used inside the decoded window, i.e., only with
offsets -N+1 .. 0.
For inspiration, we present feature files used for releases Czech and English MorphoDiTa models. Both these feature files are slight modifications of feature files described in the paper Spoustová et al. 2009: Drahomíra "johanka" Spoustová, Jan Hajič, Jan Raab, Miroslav Spousta. 2009. Semi-Supervised Training for the Averaged Perceptron POS Tagger. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), pages 763-771, Athens, Greece, March. Association for Computational Linguistics.
Feature file for English:
Tag 0,Form 0 Tag 0,Prefix1 0 Tag 0,Prefix2 0 Tag 0,Prefix3 0 Tag 0,Prefix4 0 Tag 0,Prefix5 0 Tag 0,Prefix6 0 Tag 0,Prefix7 0 Tag 0,Prefix8 0 Tag 0,Prefix9 0 Tag 0,Suffix1 0 Tag 0,Suffix2 0 Tag 0,Suffix3 0 Tag 0,Suffix4 0 Tag 0,Suffix5 0 Tag 0,Suffix6 0 Tag 0,Suffix7 0 Tag 0,Suffix8 0 Tag 0,Suffix9 0 Tag 0,Num 0 Tag 0,Cap 0 Tag 0,Dash 0 Tag 0,Tag -1 Tag 0,Tag -1,Tag -2 Tag 0,Form -1 Tag 0,Form -2 Tag 0,Form -1,Form -2 Tag 0,Form 1 Tag 0,Form 1,Form 2 Tag 0,Tag1 -1 Tag 0,Lemma -1 Lemma 0,Tag -1
Feature file for Czech (note that some feature sequences predict only part of
PoS tags trying to overcome data sparseness; Tag2
is extended PoS, Tag3
is gender, Tag5
is case):
Tag 0 Tag 0,Tag -1 Tag 0,Tag -1,Tag -2 Tag 0,Tag -2 Tag 0,Form 0 Tag 0,Form 0,Form -1 Tag 0,Form -1 Tag 0,Form -2 Tag 0,PreviousVerbTag 0 Tag 0,PreviousVerbLemma 0 Tag 0,FollowingVerbTag 0 Tag 0,FollowingVerbLemma 0 Tag 0,Lemma -1 Lemma 0,Tag -1 Tag 0,Form 1 Tag2 0,Tag5 0 Tag2 0,Tag5 0,Tag2 -1,Tag5 -1 Tag2 0,Tag5 0,Tag2 -1,Tag5 -1,Tag2 -2,Tag5 -2 Tag5 0 Tag5 0,Tag -1 Tag5 0,Tag -1,Tag -2 Tag5 0,Tag -2 Tag5 0,Form 0 Tag5 0,Form 0,Form -1 Tag5 0,Form -1 Tag5 0,Form -2 Tag5 0,PreviousVerbTag 0 Tag5 0,PreviousVerbLemma 0 Tag5 0,FollowingVerbTag 0 Tag5 0,FollowingVerbLemma 0 Tag5 0,Lemma -1 Tag5 0,Form 1 Tag3 0 Tag3 0,Tag -1 Tag3 0,Tag -1,Tag -2 Tag3 0,Tag -2 Tag3 0,Form 0 Tag3 0,Form 0,Form -1 Tag3 0,Form -1 Tag3 0,Form -2 Tag3 0,PreviousVerbTag 0 Tag3 0,PreviousVerbLemma 0 Tag3 0,FollowingVerbTag 0 Tag3 0,FollowingVerbLemma 0 Tag3 0,Lemma -1 Tag3 0,Form 1 Tag 0,Prefix1 0 Tag 0,Prefix2 0 Tag 0,Prefix3 0 Tag 0,Prefix4 0 Tag 0,Suffix1 0 Tag 0,Suffix2 0 Tag 0,Suffix3 0 Tag 0,Suffix4 0 Tag 0,Num 0 Tag 0,Cap 0 Tag 0,Dash 0 Tag5 0,Suffix1 0 Tag5 0,Suffix2 0 Tag5 0,Suffix3 0 Tag5 0,Suffix4 0
Feature file for Czech, Part of Speech only variant:
Tag 0 Tag 0,Tag -1 Tag 0,Tag -1,Tag -2 Tag 0,Tag -2 Tag 0,Form 0 Tag 0,Form 0,Form -1 Tag 0,Form -1 Tag 0,Form -2 Tag 0,PreviousVerbTag 0 Tag 0,PreviousVerbLemma 0 Tag 0,FollowingVerbTag 0 Tag 0,FollowingVerbLemma 0 Tag 0,Lemma -1 Lemma 0,Tag -1 Tag 0,Form 1 Tag 0,Prefix1 0 Tag 0,Prefix2 0 Tag 0,Prefix3 0 Tag 0,Prefix4 0 Tag 0,Suffix1 0 Tag 0,Suffix2 0 Tag 0,Suffix3 0 Tag 0,Suffix4 0 Tag 0,Num 0 Tag 0,Cap 0 Tag 0,Dash 0
9.2.3. Measuring Tagger Accuracy
Measuring custom tagger accuracy can be performed by running:
tagger_accuracy tagger_model <test_data
This binary reads input in the same format as train_tagger
,
i.e., tab separated form-lemma-tag triplets, and evaluates the accuracy
of the tagger model on the given testing data.