Parsito User's Manual
In a natural language text, the task of dependency parsing is to assign for each word in a sentence its dependency head and dependency relation to the head.
Parsito is a transition-based parser, which greedily chooses transitions from the initial state (all words in a sentence unlinked) to the final state (full dependency tree). It uses an artificial neural network classifier in every state to choose the next transition to perform. Further details are described in Straka et al. 2015: Parsing Universal Dependency Treebanks using Neural Networks and Search-Based Oracle.
Like any supervised machine learning tool, Parsito needs a trained linguistic model. This section describes the available language models and also the commandline tools and interfaces.
1. Universal Dependencies 1.2 Models
Universal Dependencies 1.2 Models are distributed under the CC BY-NC-SA licence. The models are based solely on Universal Dependencies 1.2 treebanks. The models work in Parsito version 1.0.
Universal Dependencies 1.2 Models are versioned according to the date released
in the format YYMMDD
, where YY
, MM
and DD
are two-digit
representation of year, month and day, respectively. The latest version is 151120.
1.1. Download
The latest version 151120 of the Czech MorphoDiTa models can be downloaded from LINDAT/CLARIN repository.
1.2. Acknowledgements
This work has been using language resources developed and/or stored and/or distributed by the LINDAT/CLARIN project of the Ministry of Education of the Czech Republic (project LM2010013).
The models were trained on Universal Dependencies 1.2 treebanks.
1.2.1. Publications
- (Straka et al. 2015) Straka Milan, Hajič Jan, Straková Jana and Hajič Jan jr. Parsing Universal Dependency Treebanks using Neural Networks and Search-Based Oracle. In Proceedings of the Fourteenth International Workshop on Treebanks and Linguistic Theories ({TLT\,14}), December 2015.
1.3. Model Description
The parsing models use the following CoNLL-U fields during parsing:
form
upostag
feats
All other fields (notably lemma
and xpostag
) are currently ignored.
Some language models produce non-projective trees and some projective trees, depending on which transition system performed better on development data.
2. Running the Parser
To run the parser with existing parser model, use
run_parsito parser_model
The input is assumed to be in UTF-8 encoding and by default in CoNLL-U format.
Any number of files can be specified after the parser_model
. If an argument
input_file:output_file
is used, the given input_file
is processed and
the result is saved to output_file
. If only input_file
is used, the
result is printed to standard output. If no argument is given, input is read
from standard input and written to standard output.
The full command syntax of run_parser
is
run_parsito [options] model_file [file[:output_file]]... Options: --input=conllu --output=conllu --beam_size=beam size during decoding --version --help
2.1. Input Format
The input format is specified using the --input
option. Currently supported
input formats are:
conllu
(default): the CoNLL-U format
2.2. Output Format
The output format is specified using the --output
option. Currently
supported output formats are:
conllu
(default): the CoNLL-U format
2.3. Beam Search
Optionally, beam search can be used to improve parsing accuracy, at the expense of parsing speed. When using beam search of size b, parsing is roughly 1.2 * b times slower, but the accuracy usually increases.
3. Running the Parsito REST Server
Parsito also provides REST server binary parsito_server
.
The binary uses MicroRestD as a REST
server implementation and provides Parsito REST API.
The full command syntax of parsito_server
is
parsito_server [options] port (model_name model_file acknowledgements beam_size)+ Options: --daemon --version --help
The parsito_server
can run either in foreground or in background (when
--daemon
is used). The specified model files are loaded during start and
kept in memory all the time. This behaviour might change in future to load the
models on demand.
4. Training Custom Parser Models
Training of Parsito models can be performed using the train_parsito
binary.
The first argument to train_parsito
is parsing algorithm identifier, currently
the only algorithm available is nn
.
4.1. The parsing algorithm nn
The full command syntax of train_parsito nn
is:
train_parsito nn [options] <training_data >parser_model Options: --adadelta=momentum,epsilon --adagrad=learning rate,epsilon --adam=learning rate[,beta1,beta2,final learning rate] --batch_size=batch size --dropout_hidden=hidden layer dropout --dropout_input=input dropout --early_stopping=[0|1] use early stopping --embeddings=embedding description file --heldout=heldout data file --hidden_layer=hidden layer size --hidden_layer_type=cubic|tanh (hidden layer activation function) --initialization_range=initialization range --input=conllu (input format) --iterations=number of training iterations --l1_regularization=l1 regularization factor --l2_regularization=l2 regularization factor --maxnorm_regularization=max-norm regularization factor --nodes=node selector file --structured_interval=structured prediction interval --sgd=learning rate[,final learning rate] --sgd_momentum=momentum,learning rate[,final learning rate] --single_root=[0|1] allow only single root --threads=number of training threads --transition_oracle=static|static_eager|static_lazy|dynamic --transition_system=projective|swap|link2 --version --help
The required options of train_parsito nn
are the following. Reasonable
defaults are suggested in parentheses:
iterations
: number of training iterations to use (10
)hidden_layer
: size of the hidden layer (200
)embeddings
: file containing embedding descriptionnodes
: file containing nodes descriptionsgd
,sgd_momentum
,adadelta
,adagrad
,adam
: which neural network training algorithm to use (sgd=0.02,0.001
)sgd=learning rate[,final learning rate]
: use SGD with specified learning rate, using exponential decaysgd_momentum=momentum,learning rate[,final learning rate]
: use SGD with momentum and specified learning rate, using exponential decayadadelta=momentum,epsilon
: use AdaDelta with specified parametersadagrad=learning rate,epsilon
: use AdaGrad with specified parametersadam=learning rate[,beta1,beta2,final learning rate]
: use Adam with specified parameters, optionally using exponential decay
transition_system
: which transition system to use for parsing (language dependant, you can try all and choose the best)projective
: projective stack-based arc standard system withshift
,left_arc
andright_arc
transitionsswap
: fully non-projective system which extendsprojective
system by addingswap
transitionlink2
: partially non-projective system which extendsprojective
system by addingleft_arc2
andright_arc2
transitions
transition_oracle
: which transition oracle to use for the chosentransition_system
:transition_system=projective
: available oracles arestatic
anddynamic
(dynamic
usually gives better results, but training time is slower)transition_system=swap
: available oracles arestatic_eager
andstatic_lazy
(static_lazy
almost always gives better results)transition_system=link2
: only available oracle isstatic
The additional options of train_parsito nn
are (again with suggested default values):
batch_size
(default1
): use batches of specified size (10
)dropout_hidden
(default0
): probability of dropout of hidden layer nodedropout_input
(default0
): probability of dropout of input layer nodeearly_stopping
(default1
if heldout data is given else0
): use early stopping depending on heldout LAS accuracyheldout
: use the specified file as heldout data and report the results of the trained model on themhidden_layer_type
(defaulttanh
): hidden layer activation functiontanh
cubic
relu
initialization_range
(default0.1
): maximum absolute value of initial random weights in the network; if negative value is used, the maximum absolute value of initial random weights is -initialization_range * sqrt(6.0 / (n+m))input
(defaultconllu
): input format to usel1_regularization
(default0
): L1 regularizationl2_regularization
(default0
): L2 regularization (0.3
)maxnorm_regularization
(default0
): if the L2 norm of a row in the network is larger than specified maximum, the row vector is scaled so that its norm is exactly the specified maximumsingle_root
(default0
): allow only single root when parsing, and make sure only root node hasroot
deprel (note that training data are checked to be in this format)structured_interval
(default0
): use search-based oracle in addition to thetranslation_oracle
specified. This almost always gives better results, but makes training 2-3 times slower. For details, see the paper Straka et al. 2015: Parsing Universal Dependency Treebanks using Neural Networks and Search-Based Oracle (use10
if you want high accuracy and do not mind slower training time)threads
(default 1): if more than 1, train using asynchronous SGD/AdaDelta/AdaGrad with specified number of threads. Note that asynchronous SGD/AdaDelta/AdaGrad is nondeterministic and may give lower results than synchronous one
4.1.1. Input Format
The input format is specified using the --input
option. Currently supported
input formats are:
conllu
(default): the CoNLL-U format
4.1.2. Embedding description
The embeddings used for every word are specified in the embedding description file. Each line in the file describes one embedding in the following format:
embedding_source dimension minimum_frequency [precomputed_embeddings [update_weights [maximum_precomputed_embeddings]]]
embedding_source
: for what data is the embedding created:form
: word formlemma
: word lemmauniversal_tag
: universal POS tag of the word (theupostag
field of the input CoNLL-U)tag
: language-specific POS tag of the word (thexpostag
field of the input CoNLL-U)feats
: morphological features of the word (thefeats
field of the input CoNLL-U)universal_tag_fields
: concatenation ofuniversal_tag
andfeats
deprel
: the already assigned dependency relation of the word, of any
dimension
: dimension of the embeddingminimum_frequency
: only create embeddings for values with the specified minimum frequency. If the minimum frequency is more than 1, embedding for artificial OOV value is created and used for unknown valuesprecomputed_embeddings
(default none): use precomputed embeddings (generated by for example word2vec) from the file specified. The precomputed embeddings file format is the one whichword2vec
uses.update_weights
(default1
): should the weights of precomputed embeddings be updated further during training:0
: no, keep the original precomputed embeddings1
: yes, update the precomputed embeddings2
: yes, update the precomputed embeddings, and keep only the embeddings for words found in the training data (contrary to0
and1
)
maximum_precomputed_embeddings
(default infinity): use at most this many precomputed embeddings (the ones at the beginning of the file are used, which is fine, because the embeddings are usually sorted from the most frequent value)
When precomputed embeddings are given, their casing is preserved. During inference time, several variants of a given word are tried when looking up an embedding, stopping with the first one found:
- original word
- if the first and any other letter of the word are in uppercase (or titlecase), all but the first one letters are lowercased
- all letters lowercased
- if the word begins with a digit and it does not contain any letters, the first digit alone is used
- otherwise, embedding for unknown word is used
If unsure what embedding description to use, you can use embeddings from Straka et al. 2015: Parsing Universal Dependency Treebanks using Neural Networks and Search-Based Oracle (in the paper, embeddings for forms were precomputed using word2vec on the training data):
universal_tag 20 1 feats 20 1 form 50 2 [precomputed_embeddings_if_any] deprel 20 1
4.1.3. Nodes description
Only some nodes are considered by the classifier in every parser state. Such nodes are specified in the nodes description file, one node per line, in the following format:
location index[,direction,...]
The location
can be one of:
stack
: use the stack of processed node, with index0
representing the node on top of the stackbuffer
: use the buffer of not yet processed nodes, with index0
representing the first node in the buffer
Using location
and index
, a node is found. Optionally, its parent or child can be chosen
by specifying one or more additional directions in the following format:
parent
: choose parent of the current nodechild,index
: choose a child of the current node, with the first children being 0, 1, 2, ..., and the last children being -3, -2, -1
If unsure, you can use the set of frequently used 18 nodes (used for example by Zhang and Nivre 2011: Transition-based dependency parsing with rich non-local features, or Chen and Manning 2014: A fast and accurate dependency parser using neural networks, or Straka et al. 2015: Parsing Universal Dependency treebanks using neural networks and search-based oracle):
stack 0 stack 1 stack 2 buffer 0 buffer 1 buffer 2 stack 0,child 0 stack 0,child 1 stack 0,child -2 stack 0,child -1 stack 1,child 0 stack 1,child 1 stack 1,child -2 stack 1,child -1 stack 0,child 0,child 0 stack 0,child -1,child -1 stack 1,child 0,child 0 stack 1,child -1,child -1
4.2. Measuring Parser Accuracy
Measuring custom parser model accuracy can be performed by running:
parsito_accuracy parser_model <test_data
This binary reads input in the CoNLL-U format containing (probably user-annotated) dependency trees, and evaluates the accuracy of the parser model on the given testing data.
Optionally, beam search can be used to improve parsing accuracy, at the expense of parsing speed. When using beam search of size b, parsing is roughly 1.2 * b times slower, but the accuracy usually increases.