UDPipe 1 User's Manual
This section describes the command line tools. The C++ API is described on the API Reference page.
1. Running UDPipe
Probably the most common usage of UDPipe is to tokenize, tag and parse the input using
udpipe --tokenize --tag --parse udpipe_model
The input is assumed to be in UTF-8 encoding and can be either already tokenized and segmented, or it can be a plain text which will be tokenized and segmented automatically.
Any number of input files can be specified after the udpipe_model
and if no
file is given, the standard input is used. The output is by default
saved to the standard output, but if --outfile=name
is used, it is saved
to the given file name. The output file name can contain a {}
, which is
replaced by a base name of the processed file (i.e., without directories
and an extension).
The full command syntax of running UDPipe is
Usage: udpipe [running_opts] udpipe_model [input_files] udpipe --train [training_opts] udpipe_model [input_files] udpipe --detokenize [detokenize_opts] raw_text_file [input_files] Running opts: --accuracy (measure accuracy only) --input=[conllu|generic_tokenizer|horizontal|vertical] --immediate (process sentences immediately during loading) --outfile=output file template --output=[conllu|matxin|horizontal|plaintext|vertical] --tokenize (perform tokenization) --tokenizer=tokenizer options, implies --tokenize --tag (perform tagging) --tagger=tagger options, implies --tag --parse (perform parsing) --parser=parser options, implies --parse Training opts: --method=[morphodita_parsito] which method to use --heldout=heldout data file name --tokenizer=tokenizer options --tagger=tagger options --parser=parser options Detokenize opts: --outfile=output file template Generic opts: --version --help
1.1. Immediate Mode
By default UDPipe loads the whole input file into memory before starting to process it. That allows to store the space markup (see the following Tokenizer section) in most consistent way, i.e., store all spaces following a sentence in the last token of that sentence.
However, sometimes it is desirable to process the input as soon as possible,
which can be achieved by specifying the --immediate
option. In immediate mode,
the input is processed and printed as soon as a block of input guaranteed to contain
whole sentences is loaded. Specifically, for most input formats the input is
processed after loading an empty line (with the exception of horizontal
input format and presegmented
tokenizer, where the input is processed after
each line).
1.2. Loading Model On Demand
Although a model for UDPipe always has to be specified, the model is loaded
only if really needed. It is therefore possible to use for example none
as the model in case it is not required for performing the requested
operation (e.g., converting between formats or using a generic tokenizer).
1.3. Tokenizer
If the --tokenize
option is supplied, the input is assumed to be
plain text and is tokenized using model tokenizer. Additional arguments to the
tokenizer might be specified using --tokenizer=data
option (which implies
--tokenize
), where data
is a semicolon-separated list of the following options:
normalized_spaces
: by default, UDPipe uses custom MISC fields to exactly encode spaces in the original document (as described below). If thenormalized_spaces
option is given, only the standard CoNLL-U v2 markup (SpaceAfter=No
and# newpar
) is used.presegmented
: the input file is assumed to be already segmented, with each sentence on a separate line, and is only tokenized (respecting sentence breaks)ranges
: for each token, a range in the original document is stored in the format described below.joint_with_parsing
: an experimental mode performing sentence segmentation jointly using the tokenizer and the parser (see Milan Straka and Jana Straková: Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe paper for details). The following options are utilized:joint_max_sentence_len
(default 20): maximum sentence lengthjoint_change_boundary_logprob
(default -0.5): logprob of using sentence boundary not generated by the tokenizerjoint_sentence_logprob
(default -0.5): additional logprob of every sentence
joint_sentence_logprob
and alsojoint_change_boundary_logprob
for every sentence boundary not returned by the tokenizer (i.e., either 0, 1 or 2 times). The joint sentence segmentation chooses such a segmentation, where every sentence has length at mostjoint_max_sentence_len
and the sum of logprobs of all sentences is as large as possible.
1.3.1. Preserving Original Spaces
By default, UDPipe uses custom MISC fields to store all spaces in the original
document. This markup is backward compatible with CoNLL-U v2 SpaceAfter=No
feature.
This markup can be utilized by the plaintext
output format, which allows reconstructing
the original document.
Note that in theory not only spaces, but also other original content can be saved in this way (for example XML tags if the input was encoded in a XML file).
The markup uses the following MISC fields on tokens (not words in multi-word tokens):
SpacesBefore=content
(by default empty): spaces/other content preceding the tokenSpacesAfter=content
(by default a space ifSpaceAfter=No
feature is not present, empty otherwise): spaces/other content following the tokenSpacesInToken=content
(by default equal to the FORM of the token): FORM of the token including original spaces (this is needed only if tokens are allowed to contain spaces and a token contains a tab or newline characters)
The content
of all the three fields must be escaped to allow storing tabs and newlines.
The following C-like schema is used:
\s
: space\t
: tab\r
: CR character\n
: LF character\p
: | (pipe character)\\
: \ (backslash character)
1.3.2. Preserving Token Ranges
When the ranges
tokenizer option is used, the range of each token in the original document
is stored in the TokenRange
MISC field.
The format of the TokenRange
field (inspired by Python) is TokenRange=start:end
,
where start
is a zero-based document-level index of the start of the token
(counted in Unicode characters) and end
is a zero-based document-level index
of the first character following the token (i.e., the length of the token is end-start
).
1.4. Input Formats
If the tokenizer is not used, the input format can be specified using the
--input
option. The individual input formats can be parametrized in the
same way a tokenizer is, by using format=data
syntax. Currently supported
input formats are:
conllu
(default): the CoNLL-U format. Supported options:v2
(default): use CoNLL-U v2v1
: allow loading only CoNLL-U v1 (i.e., no empty nodes and no spaces in forms and lemmas)
generic_tokenizer
: generic tokenizer for English-like languages (with spaces separating tokens and English-like punctuation). The tokenizer is rule-based and needs no trained model. It supports the same options as a model tokenizer, i.e.,normalized_spaces
,presegmented
andranges
.horizontal
: each sentence on a separate line, with tokens separated by spaces. In order to allow spaces in tokens, Unicode character 'NO-BREAK SPACE' (U+00A0) is considered part of token and converted to a space during loading.vertical
: each token on a separate line, with an empty line denoting end of sentence; only the first tab-separated word is used as a token, the rest of the line is ignored.
Note that a model tokenizer can be specified using the --input
option too, by
using the tokenizer
input format, for example using --input tokenizer=ranges
.
1.5. Tagger
If the --tag
option is supplied, the input is POS tagged and lemmatized
using the model tagger. Additional arguments to the tagger might be
specified using the --tagger=data
option (which implies --tag
).
1.6. Dependency Parsing
If the --parse
option is supplied, the input is parsed using
the model dependency parser. Additional arguments to the parser might be
specified using the --parser=data
option (which implies --parse
).
1.7. Output Formats
The output format is specified using the --output
option. The individual
output formats can be parametrized in the same way as input formats, by using the
format=data
syntax. Currently supported output formats are:
conllu
(default): the CoNLL-U format Supported options:v2
(default): use CoNLL-U v2v1
: produce output in CoNLL-U v1 format. Note that this is a lossy process, as empty nodes are ignored and spaces in forms and lemmas are converted to underscores.
matxin
: the Matxin formathorizontal
: writes the words (in the UD sense) in horizontal format, that is, each sentence is on a separate line, with words separated by a single space. Because words can contain spaces in CoNLL-U v2, the spaces in words are converted to Unicode character 'NO-BREAK SPACE' (U+00A0). Supported options:paragraphs
: an empty line is printed after the end of a paragraph or a document (recognized by# newpar
or# newdoc
comments)
plaintext
: writes the tokens (in the UD sense) using original spacing. By default, UDPipe's custom MISC features (SpacesBefore
,SpacesAfter
andSpacesInToken
, see the description in the Tokenizer section) are used to reconstruct the exact original spaces. However, if the document does not contain these features or if you want only normalized spacing, you can use the following option:normalized_spaces
: write one sentence on a line, and either one or no space between tokens according to theSpaceAfter=No
feature
vertical
: each word on a separate line, with an empty line denoting the end of sentence. Supported options:paragraphs
: an empty line is printed after the end of a paragraph or a document (recognized by# newpar
or# newdoc
comments)
2. Running the UDPipe REST Server
UDPipe also provides a REST server binary called udpipe_server
.
The binary uses MicroRestD as a REST
server implementation and provides
UDPipe REST API.
The full command syntax of udpipe_server
is
udpipe_server [options] port default_model (model_ids model_file acknowledgements)* Options: --connection_timeout=maximum connection timeout [s] (default 60) --daemon (daemonize after start, supported on Linux only) --log_file=file path (no logging if empty, default udpipe_server.log) --log_request_max_size=max req log size [kB] (0 unlimited, default 64) --max_connections=maximum network connections (default 256) --max_request_size=maximum request size [kB] (default 1024) --threads=threads to use (default 0 means unlimitted)
The udpipe_server
can run either in foreground or in background (when
--daemon
is used).
For every model, you specify three arguments:
model_ids
: possibly multiple model ids, separated by a semicolon. Note that apart from splitting on semicolon, every hyphen-separated prefix of any id is also considered an idmodel_file
: filename containing the modelacknowledgements
: if non-empty, it is added to a list of acknowledgements returned when using this model
Since UDPipe 1.1.1, the models are loaded on demand, so that at most
concurrent_models
(default 10) are kept in memory at the same time.
The model files are opened during start and never closed until the server
stops. Unless no_check_models_loadable
is specified, the model files
are also checked to be loadable during start. Note that the default model
is preloaded and never released, unless no_preload_default
is given.
(Before UDPipe 1.1.1, specified model files were loaded during start and
kept in memory all the time.)
3. Training UDPipe Models
Custom UDPipe models can be trained using the following syntax:
udpipe --train model.output [--heldout=heldout_data] training_file ...
The training data should be in the CoNLL-U format.
By default, three model components are trained – tokenizer, tagger and parser. Any subset of the model components can be trained and a model component may be copied from an existing model.
The training options are specified for each model component separately
using the --tokenizer
, --tagger
and --parser
options.
If a model component should not be trained, value none
should
be used (e.g., --tagger=none
).
The options are name=value
pairs separated by a semicolon. The
value can be either a simple string value (ending by a semicolon),
file content specified as name=file:filename
, or an arbitrary
string value specified as name=data:length:value
, where the value
is exactly length
bytes long.
3.1. Reusing Components from Existing Models
The model components (tagger, parser or tagger) can be reused from
existing models, by specifying the from_model=file:filename
option.
3.2. Random Hyperparameter Search
The default values of hyperparameters are set to the values which were used the most during UD 1.2 models training, but if you want to reach best performance, the hyperparameters must be tuned.
Apart from manual grid search, UDPipe can perform a simple random search. You can perform the random search by repeatedly training UDPipe (preferably in parallel, most likely on different computers) while specifying different training run number – some of the hyperparameters (chosen by us; you can of course override their value by specifying it on the command line) change their values in different training runs. The pseudorandom sequences of hyperparameters are of course deterministic.
The training run can be specified by providing the run=number
option
to a model component. The run number 1 is the default one (with the best
hyperparameters for the UD 1.2 models), run numbers 2 and more randomize
the hyperparameters.
3.3. Tokenizer
The tokenizer is trained using the SpaceAfter=No
features in the CoNLL-U files.
If the feature is not present, a detokenizer can be used to guess
the SpaceAfter=No
features according to a supplied plain text
(which typically does not overlap with the texts in the CoNLL-U files).
In order to use the detokenizer, use the detokenizer=file:filename_with_plaintext
option. In UD 1.2 models, the optimal performance is achieved with very small plain
texts – only 500kB.
The tokenizer recognizes the following options:
tokenize_url
(default 1): tokenize URLs and emails using a manually implemented recognizerallow_spaces
(default 1 if any token contains a space, 0 otherwise): allow tokens to contain spacesdimension
(default 24): dimension of character embeddings and of the per-character bidirectional GRU. Note that inference time is quadratic in this parameter. Supported values are only 16, 24 and 64, with 64 needed for languages with complicated tokenization like Japanese, Chinese or Vietnamese or complicated segmentation.segment_size
(default 50): length of character segment used to predict token and sentence breaks. Larger values like 200 are needed for languges with complicated segmentationepochs
(default 100): the number of epochs to train the tokenizer forbatch_size
(default 50): batch size (number of segments) used during tokenizer traininglearning_rate
(default 0.005): the learning rate used during tokenizer traininglearning_rate_final
(default 0.0): if not zero, use exponential learning rate decay so that last epoch uses this learning ratedropout
(default 0.1): dropout used during tokenizer trainingearly_stopping
(default 1 if heldout is given, 0 otherwise): perform early stopping, choosing training iteration maximizing sentences F1 score plus tokens F1 score on heldout data
During random hyperparameter search, batch_size
is chosen uniformly from {50,100}
and learning_rate
logarithmically from <0.0005, 0.01).
3.3.1. Detokenizing CoNLL-U Files
The --detokenizer
option allows generating the SpaceAfter=No
features
automatically from a given plain text. Even if the current algorithm is very
simple and makes quite a lot of mistakes, the tokenizer trained on generated
features is very close to a tokenizer trained on gold SpaceAfter=No
features (the difference in token F1 score is usually one or two tenths of percent).
The generated SpaceAfter=No
features are only used during tokenizer
training, not printed. However, if you would like to obtain the CoNLL-U files
with automatic detokenization (generated SpaceAfter=No
features), you can
run UDPipe with the --detokenize
option. In this case, you have to supply
plain text in the given language (usually the best results are achieved with
just 500kB or 1MB of text) and UDPipe then detokenizes all the given CoNLL-U files.
The complete usage of the --detokenize
option is:
udpipe --detokenize [detokenize_opts] raw_text_file [input_files] Detokenize opts: --outfile=output file template
3.4. Tagger
The tagging is currently performed using MorphoDiTa. The UDPipe tagger consists of possibly several MorphoDiTa models, each tagging some of the POS tags and/or lemmas.
By default, only one model is constructed, which generates all available tags (UPOS, XPOS, Feats and Lemma). However, we found out during the UD 1.2 models training that performance improves if one model tags the UPOS, XPOS and Feats tags, while the other is performing lemmatization. Therefore, if you utilize two MorphoDiTa models, by default the first one generates all tags (except lemmas) and the second one performs lemmatization.
The number of MorphoDiTa models can be specified using the models=number
parameter.
All other parameters may be either generic for all models (guesser_suffix_rules=5
),
or specific for a given model (guesser_suffix_rules_2=6
), including the
from_model
option (therefore, MorphoDiTa models can be trained separately
and then combined together into one UDPipe model).
Every model utilizes UPOS for disambiguation and the first model is the one producing the UPOS tags on output.
The tagger recognizes the following options:
use_lemma
(default for the second model and also if there is only one model): use the lemma field internally to perform disambiguation; the lemma may be not outputtedprovide_lemma
(default for the second model and also if there is only one model): produce the disambiguated lemma on outputuse_xpostag
(default for the first model): use the XPOS tags internally to perform disambiguation; it may not be outputtedprovide_xpostag
(default for the first model): produce the disambiguated XPOS tag on outputuse_feats
(default for the first model): use the Feats internally to perform disambiguation; it may not be outputtedprovide_feats
(default for the first model): produce the disambiguated Feats field on outputdictionary_max_form_analyses
(default 0 - unlimited): the maximum number of (most frequent) form analyses from UD training data that are to be kept in the morphological dictionarydictionary_file
(default empty): use a given custom morphological dictionary, where each line contains 5 tab-separated fields FORM, LEMMA, UPOSTAG, XPOSTAG and FEATS. Note that this dictionary data is appended to the dictionary created from the UD training data, not replacing it.guesser_suffix_rules
(default 8): number of rules generated for every suffixguesser_prefixes_max
(default 4 if ``provide_lemma`, 0 otherwise): maximum number of form-generating prefixes to use in the guesserguesser_prefix_min_count
(default 10): minimum number of occurrences of form-generating prefix to consider using it in the guesserguesser_enrich_dictionary
(default 6 if nodictionary_file
is passed, 0 otherwise): number of rules generated for forms present in training data (assuming that the analyses from the training data may not be all)iterations
(default 20): number of training iterations to performearly_stopping
(default 1 if heldout is given, 0 otherwise): perform early stopping, choosing training iteration maximizing tagging accuracy on the heldout datatemplates
(defaultlemmatizer
for second model,tagger
otherwise): MorphoDiTa feature templates to use, eitherlemmatizer
which focuses more on lemmas, ortagger
which focuses more on UPOS/XPOS/FEATS
During random hyperparameter search, guesser_suffix_rules
is chosen uniformly from
{5,6,7,8,9,10,11,12} and guesser_enrich_dictionary
is chosen uniformly from
{3,4,5,6,7,8,9,10}.
3.5. Parser
The parsing is performed using Parsito, which is a transition-based parser using a neural-network classifier.
The transition-based systems can be configured by the following options:
transition_system
(default projective): which transition system to use for parsing (language dependent, you can choose according to language properties or try all and choose the best one)projective
: projective stack-based arc standard system withshift
,left_arc
andright_arc
transitionsswap
: fully non-projective system which extendsprojective
system by adding theswap
transitionlink2
: partially non-projective system which extendsprojective
system by addingleft_arc2
andright_arc2
transitions
transition_oracle
(default dynamic/static_lazy_static whichever first is applicable): which transition oracle to use for the chosentransition_system
:transition_system=projective
: available oracles arestatic
anddynamic
(dynamic
usually gives better results, but training time is slower)transition_system=swap
: available oracles arestatic_eager
andstatic_lazy
(static_lazy
almost always gives better results)transition_system=link2
: only available oracle isstatic
structured_interval
(default 8): use search-based oracle in addition to thetranslation_oracle
specified. This almost always gives better results, but makes training 2-3 times slower. For details, see the paper Straka et al. 2015: Parsing Universal Dependency Treebanks using Neural Networks and Search-Based Oraclesingle_root
(default 1): allow only single root when parsing, and make sure only the root node has theroot
deprel (note that training data are checked to be in this format)
The Lemmas/UPOS/XPOS/FEATS used by the parser are configured by:
use_gold_tags
(default 0): if false and a tagger exists, the Lemmas/UPOS/XPOS/FEATS for both the training and heldout data are generated by the tagger, otherwise they are taken from the gold data
The embeddings used by the parser can be specified as follows:
embedding_upostag
(default 20): the dimension of the UPos embedding used in the parserembedding_feats
(default 20): the dimension of the Feats embedding used in the parserembedding_xpostag
(default 0): the dimension of the XPos embedding used in the parserembedding_form
(default 50): the dimension of the Form embedding used in the parserembedding_lemma
(default 0): the dimension of the Lemma embedding used in the parserembedding_deprel
(default 20): the dimension of the Deprel embedding used in the parserembedding_form_file
: pre-trained word embeddings inword2vec
textual formatembedding_lemma_file
: pre-trained lemma embeddings inword2vec
textual formatembedding_form_mincount
(default 2): for forms not present in the pre-trained embeddings, generate random embeddings if the form appears at least this number of times in the trainig data (forms not present in the pre-trained embeddings and appearing less number of times are considered OOV)embedding_lemma_mincount
(default 2): for lemmas not present in the pre-trained embeddings, generate random embeddings if the lemma appears at least this number of times in the trainig data (lemmas not present in the pre-trained embeddings and appearing less number of times are considered OOV)
The neural-network training options:
iterations
(default 10): number of training iterations to usehidden_layer
(default 200): the size of the hidden layerbatch_size
(default 10): batch size used during neural-network traininglearning_rate
(default 0.02): the learning rate used during neural-network traininglearning_rate_final
(0.001): the final learning rate used during neural-network trainingl2
(0.5): the L2 regularization used during neural-network trainingearly_stopping
(default 1 if heldout is given, 0 otherwise): perform early stopping, choosing training iteration maximizing LAS on heldout data
During random hyperparameter search, structured_interval
is chosen uniformly from
{0,8,10}, learning_rate
is chosen logarithmically from <0.005,0.04)
and
l2
is chosen uniformly from <0.2,0.6)
.
3.5.1. Pre-trained Word Embeddings
The pre-trained word embeddings for forms and lemmas can be specified in the
word2vec
textual format using the embedding_form_file
and
embedding_lemma_file
options.
Note that pre-training word embeddings even on the UD data itself improves the
accuracy (we use word2vec
with
-cbow 0 -size 50 -window 10 -negative 5 -hs 0 -sample 1e-1 -threads 12 -binary 0 -iter 15 -min-count 2
options to pre-train on the UD data after converting it to the horizontal format using
udpipe --output=horizontal
).
Forms and lemmas can contain spaces in CoNLL-U v2, so these spaces are
converted to a Unicode character 'NO-BREAK SPACE' (U+00A0) before performing the
embedding lookup, because spaces are usually used to delimit tokens in word
embedding generating software (both word2vec
and glove
use spaces to
separate words on input and on output). When using UDPipe to generate
plain texts from CoNLL-U format using --output=horizontal
, this space
replacing happens automatically.
When looking up an embedding for a given word, the following possibilities are tried in the following order until a match is found (or an embedding for unknown word is returned):
- original word
- all but the first character lowercased
- all characters lowercased
- if the word contains only digits, just the first digit is tried
3.6. Measuring Model Accuracy
Measuring custom model accuracy can be performed by running:
udpipe --accuracy [udpipe_options] udpipe_model file ...
The command syntax is similar to the regular UDPipe operation, only
the input must be always in the CoNLL-U format
and the --input
and --output
options are ignored.
Three different settings (depending on --tokenize(r)
, --tag(ger)
and --parse(r)
)
can be evaluated:
--tokenize(r) [--tag(ger) [--parse(r)]]
: Tokenizer is used to segment and tokenize plain text (obtained bySpaceAfter=No
features and# newdoc
and# newpar
comments in the input file). Optionally, a tagger is used on the resulting data to obtain Lemma/UPOS/XPOS/Feats columns and eventually a parser can be used to parse the results. The tokenizer is evaluated using F1-score on tokens, multi-word tokens, sentences and words. The words are aligned using a word-alignment algorithm described in the CoNLL 2017 Shared Task in UD Parsing. The tagger and parser are evaluated on aligned words, resulting in F1 scores of Lemmas/UPOS/XPOS/Feats/UAS/LAS.--tag(ger) [--parse(r)]
: The gold segmented and tokenized input is tagged (and then optionally parsed using the tagger outputs) and then evaluated.--parse(r)
: The gold segmented and tokenized input is parsed using gold morphology (Lemmas/UPOS/XPOS/Feats) and evaluated.