Korektor User's Manual

Czech Korektor Models
Running the Korektor
Running the tokenizer
- 3.1. Input Formats
- 3.2. Output Formats
Running the REST Server

Korektor is a statistical spellchecker and (occasional) grammar checker. Like any supervised machine learning tool, Korektor needs a trained linguistic model. We now describe the available language models, and then the command line tools korektor and tokenizer.

1. Czech Korektor Models

Czech models are distributed under the CC BY-NC-SA licence. The Czech morphology used by the model is based on MorfFlex CZ Czech models work in Korektor version 2 or later.

Czech models are versioned according to the release date in format YYMMDD, where YY, MM and DD are two-digit representation of year, month and day, respectively. The latest version is 130202.

1.1. Download

The latest version 130202 of the Czech Korektor models can be downloaded from LINDAT/CLARIN repository.

1.2. Acknowledgements

This work has been using language resources developed and/or stored and/or distributed by the LINDAT/CLARIN project of the Ministry of Education of the Czech Republic (project LM2010013).

The latest Czech models were created by Michal Richter as part of his Master thesis and are described in (Richter et al. 2012).

The MorfFlex CZ dictionary was created by Jan Hajič and Jaroslava Hlaváčová.

1.2.1. Publications

(Richter et al. 2012) Richter Michal, Straňák Pavel and Rosen Alexandr. Korektor – A System for Contextual Spell-checking and Diacritics Completion In Proceedings of the 24th International Conference on Computational Linguistics (Coling 2012), pages 1-12, Mumbai, India, 2012.

1.3. Czech Model Variants

The Czech model contains the following variants:

korektok-czech-130202/diacritics_h2mor.conf: Spellchecker model which only _adds_ diacritical marks. Note that the diacritical marks are not removed by the model, so you have to strip them manually if you want to ignore them.
korektok-czech-130202/spellchecking_h2mor.conf: Spellchecker model which considers corrections with edit distance at most once. You should use this model for generic spellchecking.
korektok-czech-130202/spellchecking_h2mor_2edits.conf: Spellchecker model which considers corrections with edit distance at most two. This model can be useful if the required corrections are not found by the spellchecking_h2mor.conf model, but it may be considerably slower.

2. Running the Korektor

The korektor binary is used to run the Korektor. The only required argument is the model configuration which should be used for correcting. The input is read from standard input, it should be in UTF-8 encoding and it can be either already tokenized and segmented, segmented only, or it can be a plain text. The output is written to standard output and it is in UTF-8 encoding.

The full command syntax of korektor is

korektor [options] model_configuration
Options: --input=untokenized|untokenized_lines|segmented|vertical|horizontal
         --output=original|xml|vertical|horizontal
         --corrections=maximum_number_of_corrections
         --viterbi_order=viterbi_decoding_order
         --viterbi_beam_size=maximum_viterbi_beam_size
         --viterbi_stage_pruning=maximum_viterbi_stage_cost_increment
         --context_free
         --version
         --help

2.1. Input Formats

The input format is specified using the --input option. Currently supported input formats are:

untokenized (default): the input is a plain text, which is segmented and tokenized automatically. Note that sentences can span over multiple lines, but an empty lines always terminate a sentence.
untokenized_lines: very similar to untokenized, the only difference is that sentences cannot span over multiple lines, so every newline is a sentence terminator.
segmented: the input is assumed to be segmented using newlines, but it is tokenized automatically.
vertical: the input is tokenized and segmented in vertical format, every line is considered a word, with empty line denoting end of sentence.
horizontal: the input is tokenized and segmented in horizontal format, every line is a sentence, with words separated by spaces.

2.2. Number of Corrections

The maximum number of corrections that Korektor should return for every word is specified using the --corrections option, and defaults to one.

Note that some output formats cannot handle multiple corrections, because they can only replace the original word by a corrected one.

2.3. Output Formats

The output format is specified using the --output option. Currently supported output formats are:

original (default when number of corrections is 1): the original words are replaced by the corrected ones, all other characters including spaces are preserved. Note that this output format cannot handle multiple corrections per word.
xml (default when number of corrections is greater than 1): the original input is encoded as XML and the suggested corrections are marked using the following XML elements:
- spelling corrections for a word w are marked using the spelling element with the suggested corrections listed in the suggestions attribute ordered by correction probability with the most probable one first
- grammar corrections are marked as spelling corrections, but the grammar element is used instead of spelling
To illustrate, consider the input
```
  Hoši jely k babicce.
```
The output in xml output format with at most three corrections is
```
  Hoši <grammar suggestions="jeli jely jel">jely</grammar> k <spelling suggestions="babičce babice babince">babicce</spelling>.
```
vertical: each word is printed on a separate line, with empty line denoting end of sentence. If there are any suggested corrections for a word, they are printed on the same line as the original words using several tab separated columns:
- the first column contain the original word
- the second column contain either letter S or G, where S denotes a spelling correction and G denotes a grammar correction
- the rest of the columns are the suggested corrections ordered by correction probability with the most probable one first
To illustrate, consider the input
```
  Hoši jely k babicce.
```
The output in vertical output format with at most three corrections with explicitly marked tab characters is
```
  Hoši
  jely<---tab---->G<--tab-->jeli<---tab---->jely<--tab--->jel
  k
  babicce<--tab-->S<--tab-->babičce<--tab-->babice<--tab-->babince
  .
```
horizontal: the original words are replaced by the corrected ones. Each sentence is printed on separate line and all words are space separated. Note that this output format cannot handle multiple corrections per word.

2.4. Context Free Corrections

Context free corrections can be generated by supplying the --context_free option. In that case each word is considered separately and sentences boundaries are ignored. This mode produces much worse results and should be used only when no context is really available.

2.5. Viterbi Decoding Options

The decoding Viterbi algorithm can be tweaked using the following options:

--viterbi_order: Use specific Viterbi decoding order instead of the default one. Use 1 for fastest execution, but worst accuracy. Setting this to higher value than maximum model order minus one has no effect.
--viterbi_beam_size: Limit Viterbi beam size to specified constant. Use smaller value for faster execution, but worse accuracy.
--viterbi_stage_pruning: Limit maximum cost increment in one Viterbi stage. Use smaller value for faster execution, but worse accuracy.

3. Running the tokenizer

The tokenizer binary is used to run the tokenizer. The input is read from standard input, it should be in UTF-8 encoding and it can be either already tokenized and segmented, segmented only, or it can be a plain text. The output is written to standard output and it is in UTF-8 encoding.

The full command syntax of tokenizer is

korektor [options] model_configuration
Options: --input=untokenized|untokenized_lines|segmented|vertical|horizontal
         --output=vertical|horizontal
         --version
         --help

3.1. Input Formats

The input format is specified using the --input option. Currently supported input formats are:

untokenized (default): the input is a plain text, which is segmented and tokenized automatically. Note that sentences can span over multiple lines, but an empty lines always terminate a sentence.
untokenized_lines: very similar to untokenized, the only difference is that sentences cannot span over multiple lines, so every newline is a sentence terminator.
segmented: the input is assumed to be segmented using newlines, but it is tokenized automatically.
vertical: the input is tokenized and segmented in vertical format, every line is considered a word, with empty line denoting end of sentence.
horizontal: the input is tokenized and segmented in horizontal format, every line is a sentence, with words separated by spaces.

3.2. Output Formats