NameTag 3 Formats
This page describes the NameTag 3 REST Web Service input and output formats.
1. Input Formats
The input format is specified using the input
parameter. Currently supported
input formats are:
untokenized
(default): the input will be tokenized and segmented using a tokenizer defined by the model,vertical
: the input is in vertical format, every line is considered a word, with empty line denoting end of sentence.conllu-ne
: the input is in the CoNLL-U format.
2. Output Formats
The output format is specified using the output
parameter. Currently
supported output formats are:
xml
(default): Simple XML format without a root element, using<sentence>
element to mark sentences and<token>
element to mark tokens. The recognized named entities are encoded using<ne type="...">
element. Example input:Václav Havel byl český dramatik, esejista, kritik komunistického režimu a později politik.
A NameTag identifies a first name (pf
), a surname (ps
) and a person name container (P
) in the input (line breaks added):<sentence><ne type="P"><ne type="pf"><token>Václav</token></ne> <ne type="ps"><token>Havel</token></ne></ne> <token>byl</token> <token>český</token> <token>dramatik</token><token>,</token> <token>esejista</token><token>,</token> <token>kritik</token> <token>komunistického</token> <token>režimu</token> <token>a</token> <token>později</token> <token>politik</token><token>.</token></sentence>
vertical
: Every found named entity is on a separate line. Each line contains three tab-separated fields: entity_range, entity_type and entity_text. The entity_range is composed of token identifiers (counting from 1 and including end-of-sentence; if the input is alsovertical
, token identifiers correspond exactly to line numbers) of tokens forming the named entity and entity_type represents its type. The entity_text is not strictly necessary and contains space separated words of this named entity. Example input:Václav Havel byl český dramatik, esejista, kritik komunistického režimu a později politik.
Example output:1,2 P Václav Havel 1 pf Václav 2 ps Havel
conll
: A CoNLL-like vertical format. Every word is on a line, followed by a tab and recognized entity label. An empty line denotes end of sentence. The entity labels are:O
: no entityB-type
: the word is the first in the entity of typetype
I-type
: the word is a non-initial word in the entity of typetype
Václav Havel byl český dramatik, esejista, kritik komunistického režimu a později politik.
Example output:Václav B-P Havel I-P byl O český O ...
conllu-ne
: the output is in the CoNLL-U format, with named entities in theMISC
column. All labels corresponding to the token create one item in theMISC
column, delimited from the other annotations by vertical bars|
. The item key isNE=
. If there are multiple labels, they are delimited by a hyphen-
. All named entity mentions receive a unique number identificator, appended to the label with and underscore_
. Example input:1 Jmenuji jmenovat VERB VB-S---1P-AA--1 Mood=Ind|Number=Sing|Person=1|Polarity=Pos|Tense=Pres|VerbForm=Fin|Voice=Act 0 root _ TokenRange=0:7 2 se se PRON P7-X4---------- Case=Acc|PronType=Prs|Reflex=Yes|Variant=Short 1 expl:pv _ TokenRange=8:10 3 Jan Jan PROPN NNMS1-----A---- Animacy=Anim|Case=Nom|Gender=Masc|NameType=Giv|Number=Sing|Polarity=Pos 1 nsubj _ TokenRange=11:14 4 Novák Novák PROPN NNMS1-----A---- Animacy=Anim|Case=Nom|Gender=Masc|NameType=Sur|Number=Sing|Polarity=Pos 3 flat _ SpaceAfter=No|TokenRange=15:20 5 . . PUNCT Z:------------- _ 1 punct _ SpacesAfter=\n|TokenRange=20:21
Example output:1 Jmenuji jmenovat VERB VB-S---1P-AA--1 Mood=Ind|Number=Sing|Person=1|Polarity=Pos|Tense=Pres|VerbForm=Fin|Voice=Act 0 root _ TokenRange=0:7 2 se se PRON P7-X4---------- Case=Acc|PronType=Prs|Reflex=Yes|Variant=Short 1 expl:pv _ TokenRange=8:10 3 Jan Jan PROPN NNMS1-----A---- Animacy=Anim|Case=Nom|Gender=Masc|NameType=Giv|Number=Sing|Polarity=Pos 1 nsubj _ TokenRange=11:14|NE=P_1-pf_2 4 Novák Novák PROPN NNMS1-----A---- Animacy=Anim|Case=Nom|Gender=Masc|NameType=Sur|Number=Sing|Polarity=Pos 3 flat _ SpaceAfter=No|TokenRange=15:20|NE=P_1-ps_3 5 . . PUNCT Z:------------- _ 1 punct _ SpacesAfter=\n|TokenRange=20:21