Korektor User's Manual
Korektor is a statistical spellchecker and (occasional) grammar checker.
Like any supervised machine learning tool, Korektor needs a trained linguistic model.
We now describe the available language models, and then the command line tools
korektor
and tokenizer
.
1. Czech Korektor Models
Czech models are distributed under the CC BY-NC-SA licence. The Czech morphology used by the model is based on MorfFlex CZ Czech models work in Korektor version 2 or later.
Czech models are versioned according to the release date in format
YYMMDD
, where YY
, MM
and DD
are two-digit representation of
year, month and day, respectively. The latest version is 130202.
1.1. Download
The latest version 130202 of the Czech Korektor models can be downloaded from LINDAT/CLARIN repository.
1.2. Acknowledgements
This work has been using language resources developed and/or stored and/or distributed by the LINDAT/CLARIN project of the Ministry of Education of the Czech Republic (project LM2010013).
The latest Czech models were created by Michal Richter as part of his Master thesis and are described in (Richter et al. 2012).
The MorfFlex CZ dictionary was created by Jan Hajič and Jaroslava Hlaváčová.
1.2.1. Publications
- (Richter et al. 2012) Richter Michal, Straňák Pavel and Rosen Alexandr. Korektor – A System for Contextual Spell-checking and Diacritics Completion In Proceedings of the 24th International Conference on Computational Linguistics (Coling 2012), pages 1-12, Mumbai, India, 2012.
1.3. Czech Model Variants
The Czech model contains the following variants:
korektok-czech-130202/diacritics_h2mor.conf
- Spellchecker model which only _adds_ diacritical marks. Note that the diacritical marks are not removed by the model, so you have to strip them manually if you want to ignore them.
korektok-czech-130202/spellchecking_h2mor.conf
- Spellchecker model which considers corrections with edit distance at most once. You should use this model for generic spellchecking.
korektok-czech-130202/spellchecking_h2mor_2edits.conf
-
Spellchecker model which considers corrections with edit distance at most
two. This model can be useful if the required corrections are not found
by the
spellchecking_h2mor.conf
model, but it may be considerably slower.
2. Running the Korektor
The korektor
binary is used to run the Korektor. The only required argument
is the model configuration which should be used for correcting. The input is read
from standard input, it should be in UTF-8 encoding and it can be either
already tokenized and segmented, segmented only, or it can be a plain text.
The output is written to standard output and it is in UTF-8 encoding.
The full command syntax of korektor
is
korektor [options] model_configuration Options: --input=untokenized|untokenized_lines|segmented|vertical|horizontal --output=original|xml|vertical|horizontal --corrections=maximum_number_of_corrections --viterbi_order=viterbi_decoding_order --viterbi_beam_size=maximum_viterbi_beam_size --viterbi_stage_pruning=maximum_viterbi_stage_cost_increment --context_free --version --help
2.1. Input Formats
The input format is specified using the --input
option. Currently supported
input formats are:
untokenized
(default): the input is a plain text, which is segmented and tokenized automatically. Note that sentences can span over multiple lines, but an empty lines always terminate a sentence.untokenized_lines
: very similar tountokenized
, the only difference is that sentences cannot span over multiple lines, so every newline is a sentence terminator.segmented
: the input is assumed to be segmented using newlines, but it is tokenized automatically.vertical
: the input is tokenized and segmented in vertical format, every line is considered a word, with empty line denoting end of sentence.horizontal
: the input is tokenized and segmented in horizontal format, every line is a sentence, with words separated by spaces.
2.2. Number of Corrections
The maximum number of corrections that Korektor should return for every word is
specified using the --corrections
option, and defaults to one.
Note that some output formats cannot handle multiple corrections, because they can only replace the original word by a corrected one.
2.3. Output Formats
The output format is specified using the --output
option. Currently
supported output formats are:
original
(default when number of corrections is 1): the original words are replaced by the corrected ones, all other characters including spaces are preserved. Note that this output format cannot handle multiple corrections per word.xml
(default when number of corrections is greater than 1): the original input is encoded as XML and the suggested corrections are marked using the following XML elements:- spelling corrections for a word w are marked using the
spelling
element with the suggested corrections listed in thesuggestions
attribute ordered by correction probability with the most probable one first - grammar corrections are marked as spelling corrections, but the
grammar
element is used instead ofspelling
Hoši jely k babicce.
The output inxml
output format with at most three corrections isHoši <grammar suggestions="jeli jely jel">jely</grammar> k <spelling suggestions="babičce babice babince">babicce</spelling>.
- spelling corrections for a word w are marked using the
vertical
: each word is printed on a separate line, with empty line denoting end of sentence. If there are any suggested corrections for a word, they are printed on the same line as the original words using severaltab
separated columns:- the first column contain the original word
- the second column contain either letter
S
orG
, whereS
denotes a spelling correction andG
denotes a grammar correction - the rest of the columns are the suggested corrections ordered by correction probability with the most probable one first
Hoši jely k babicce.
The output invertical
output format with at most three corrections with explicitly markedtab
characters isHoši jely<---tab---->G<--tab-->jeli<---tab---->jely<--tab--->jel k babicce<--tab-->S<--tab-->babičce<--tab-->babice<--tab-->babince .
horizontal
: the original words are replaced by the corrected ones. Each sentence is printed on separate line and all words are space separated. Note that this output format cannot handle multiple corrections per word.
2.4. Context Free Corrections
Context free corrections can be generated by supplying the --context_free
option. In that case each word is considered separately and sentences boundaries
are ignored. This mode produces much worse results and should be used only when
no context is really available.
2.5. Viterbi Decoding Options
The decoding Viterbi algorithm can be tweaked using the following options:
--viterbi_order
: Use specific Viterbi decoding order instead of the default one. Use 1 for fastest execution, but worst accuracy. Setting this to higher value than maximum model order minus one has no effect.--viterbi_beam_size
: Limit Viterbi beam size to specified constant. Use smaller value for faster execution, but worse accuracy.--viterbi_stage_pruning
: Limit maximum cost increment in one Viterbi stage. Use smaller value for faster execution, but worse accuracy.
3. Running the tokenizer
The tokenizer
binary is used to run the tokenizer. The input is read
from standard input, it should be in UTF-8 encoding and it can be either
already tokenized and segmented, segmented only, or it can be a plain text.
The output is written to standard output and it is in UTF-8 encoding.
The full command syntax of tokenizer
is
korektor [options] model_configuration Options: --input=untokenized|untokenized_lines|segmented|vertical|horizontal --output=vertical|horizontal --version --help
3.1. Input Formats
The input format is specified using the --input
option. Currently supported
input formats are:
untokenized
(default): the input is a plain text, which is segmented and tokenized automatically. Note that sentences can span over multiple lines, but an empty lines always terminate a sentence.untokenized_lines
: very similar tountokenized
, the only difference is that sentences cannot span over multiple lines, so every newline is a sentence terminator.segmented
: the input is assumed to be segmented using newlines, but it is tokenized automatically.vertical
: the input is tokenized and segmented in vertical format, every line is considered a word, with empty line denoting end of sentence.horizontal
: the input is tokenized and segmented in horizontal format, every line is a sentence, with words separated by spaces.
3.2. Output Formats
The output format is specified using the --output
option. Currently
supported output formats are:
vertical
: each word is printed on a separate line, with empty line denoting end of sentence.horizontal
: each sentence is printed on separate line and all words are space separated.
4. Running the REST Server
The REST server can be run using the korektor_server
binary.
The binary uses MicroRestD as a REST
server implementation and provides
Korektor REST API.
The full command syntax of korektor_server
is
korektor_server [options] port (model_name weblicht_language model_file acknowledgements)* Options: --daemon
The korektor_server
can run either in foreground or in background (when
--daemon
is used). The specified model files are loaded during start and
kept in memory all the time. This behaviour might change in future to load the
models on demand.