NameTag 2 Models
In natural language text, the task of nested named entity recognition (NER) is to identify proper names such as names of persons, organizations and locations. NameTag 2 identifies and classifies nested named entities by an algorithm described in Straková et al. 2019.
Like any supervised machine learning tool, NameTag needs a trained linguistic model. This section describes the available language models.
All models are available under the CC BY-NC-SA licence and can be
downloaded from LINDAT repository. The
latest version is 210916
.
The models work in NameTag version 2.
All models use UDPipe for tokenization.
The models are versioned according to the date when released, the version
format is YYMMDD
, where YY
, MM
and DD
are two-digit
representation of year, month and day, respectively. The latest version is210916
.
1. Czech CNEC2.0 Model
The Czech model is trained on training part of the Czech Named Entity Corpus 2.0 (Ševčíková et al., 2007).
The corpus uses 46 atomic named entity types, which can be embedded,
e.g., the river name can be part of a name of a city as in <gu Ústí nad <gh Labem>>
. There are also 4 so-called NE containers: two or more NEs are
parts of a NE container (e.g., two NEs, a first name and a surname, form
together a person name NE container such as in <P <pf Jan><ps Novák>>
).
The 4 NE containers are marked with a capital one-letter tag: P
for
(complex) person names, T
for temporal expressions, A
for addresses,
and C
for bibliographic items.
The latest version is 200831
, distributed by LINDAT.
The model czech-cnec2.0-200831
reaches 83.44 F1-measure for fine-grained, two-character types and 87.04 for coarse, one-character supertypes on the CNEC2.0 test
data.
1.1. Acknowledgements
The work described herein has been supported by OP VVV VI LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project CZ.02.1.01/0.0/0.0/16 013/0001781) and it has been supported by LINDAT/CLARIAH-CZ project of the Ministry of Education, Youth and Sports of the Czech Republic (LM2023062, LM2018101). It has also been supported by the Mellon Foundation grant No. G-1901-06505. It has further been supported by LUSyD GX20-16819X.
Czech CNEC 2.0 model is trained on Czech Named Entity Corpus 2.0, which was created by Magda Ševčíková, Zdeněk Žabokrtský, Jana Straková and Milan Straka.
The research was carried out by Jana Straková and Milan Straka.
All models use UDPipe for tokenization.
1.1.1. Publications
Straková Jana, Straka Milan, Hajič Jan: Neural Architectures for Nested NER through Linearization. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Copyright © Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-950737-48-2, pp. 5326-5331, 2019.
Straka Milan, Straková Jana, Hajič Jan: Czech Text Processing with Contextual Embeddings: POS Tagging, Lemmatization, Parsing and NER. In: Lecture Notes in Computer Science, Vol. 11697, Proceedings of the 22nd International Conference on Text, Speech and Dialogue - TSD 2019, Copyright © Springer International Publishing, Cham / Heidelberg / New York / Dordrecht / London, ISBN 978-3-030-27946-2, ISSN 0302-9743, pp. 137-150, 2019.
Straková Jana, Straka Milan, Hajič Jan, Popel Martin: Hluboké učení v automatické analýze českého textu. In: Slovo a slovesnost, Vol. 80, No. 4, Copyright © Ústav pro jazyk český AV ČR, Prague, Czech Republic, ISSN 0037-7031, pp. 306-327, Dec 2019.
2. English CoNLL Model
The English model is trained on training part of the
CoNLL-2003 NER annotations (Sang and De Meulder, 2003)
of part of Reuters Corpus.
The corpus uses four classes PER
, ORG
, LOC
and MISC
.
The latest version is 200831
, distributed by LINDAT.
The model english-conll-200831
reaches 91.68 F1-measure on the CoNLL-2003
test data.
2.1. Acknowledgements
The work described herein has been supported by OP VVV VI LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project CZ.02.1.01/0.0/0.0/16 013/0001781) and it has been supported by LINDAT/CLARIAH-CZ project of the Ministry of Education, Youth and Sports of the Czech Republic (LM2023062, LM2018101). It has also been supported by the Mellon Foundation grant No. G-1901-06505. It has further been supported by LUSyD GX20-16819X.
The research was carried out by Jana Straková and Milan Straka.
All models use UDPipe for tokenization.
2.1.1. Publications
Straková Jana, Straka Milan, Hajič Jan: Neural Architectures for Nested NER through Linearization. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Copyright © Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-950737-48-2, pp. 5326-5331, 2019.
3. German CoNLL Model
The German model is trained on training part of the
CoNLL-2003 NER annotations (Sang and De Meulder, 2003) of part of
Reuters Corpus.
The corpus uses four classes PER
, ORG
, LOC
and MISC
.
The latest version is 200831
, distributed by LINDAT.
The model german-conll-200831
reaches 82.65 F1-measure on the CoNLL-2003
test data.
3.1. Acknowledgements
The work described herein has been supported by OP VVV VI LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project CZ.02.1.01/0.0/0.0/16 013/0001781) and it has been supported by LINDAT/CLARIAH-CZ project of the Ministry of Education, Youth and Sports of the Czech Republic (LM2023062, LM2018101). It has also been supported by the Mellon Foundation grant No. G-1901-06505. It has further been supported by LUSyD GX20-16819X.
The research was carried out by Jana Straková and Milan Straka.
All models use UDPipe for tokenization.
3.1.1. Publications
Straková Jana, Straka Milan, Hajič Jan: Neural Architectures for Nested NER through Linearization. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Copyright © Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-950737-48-2, pp. 5326-5331, 2019.
4. German GermEval Model
The German model is trained on training part of the GermEval 2014 NER Shared Task (Benikova et al., 2014).
This corpus annotation uses nested entities (i.e., an entity can be embedded
into another entity), and the nestedness is limited to at most two levels
(outside entity and inside entity). The annotation accounts for derivatives and
tokens that contain named entity only partially, thus the annotation uses the
folllowing labels: PER
, ORG
, LOC
, OTH
and O
for not an
entity; and further PERderiv
, ORGderiv
, LOCderiv
and OTHderiv
for derivatives and PERpart
, ORGpart
, LOCpart
and OTHpart
for
partial entities (e.g., Troia-Ausstellung, in which only Troia is the named
entity, example from Benikova et al., 2014).
The latest version is 210916
, distributed by LINDAT.
The model german-germeval-210916
reaches 84.40 F1-measure on the GermEval 2014 test data, measured with the official shared task evaluation script ``nereval.perl''.
4.1. German GermEval State of the Art
4.1.1. Systems trained on GermEval 2014 training data
F1 (strict, official) | System |
84.40 | NameTag2 german-germeval-210916 model |
79.10 | Modular Classifier (Hänig 2014) |
78.42 | Semi-Supervised Features (Agerri 2017) |
76.37 | (Riedl and Padó, 2020) |
76.12 | Hybrid Neural Networks (Shao 2016) |
4.1.2. Systems trained on additional data
F1 (strict, official) | System |
84.73 | (Riedl and Padó, 2020), transfer from CoNLL data |
4.2. Acknowledgements
The work described herein has been supported by OP VVV VI LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project CZ.02.1.01/0.0/0.0/16 013/0001781) and it has been supported by LINDAT/CLARIAH-CZ project of the Ministry of Education, Youth and Sports of the Czech Republic (LM2023062, LM2018101). It has also been supported by the Mellon Foundation grant No. G-1901-06505. It has further been supported by LUSyD GX20-16819X.
The research was carried out by Jana Straková and Milan Straka.
All models use UDPipe for tokenization.
4.2.1. Publications
The methodology is from the following publication, although the result has been measured later (in 2021) and is yet unpublished:
Straková Jana, Straka Milan, Hajič Jan: Neural Architectures for Nested NER through Linearization. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Copyright © Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-950737-48-2, pp. 5326-5331, 2019.
5. Dutch CoNLL Model
The Dutch model is trained on training part of the
CoNLL-2002 NER annotations (Tjong Kim Sang, 2002).
The corpus uses four classes: PER
, ORG
, LOC
and MISC
.
The latest version is 200831
, distributed by LINDAT.
The model dutch-conll-200831
reaches 91.17 F1-measure on the CoNLL-2002
test data.
5.1. Acknowledgements
The work described herein has been supported by OP VVV VI LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project CZ.02.1.01/0.0/0.0/16 013/0001781) and it has been supported by LINDAT/CLARIAH-CZ project of the Ministry of Education, Youth and Sports of the Czech Republic (LM2023062, LM2018101). It has also been supported by the Mellon Foundation grant No. G-1901-06505. It has further been supported by LUSyD GX20-16819X.
The research was carried out by Jana Straková and Milan Straka.
All models use UDPipe for tokenization.
5.1.1. Publications
Straková Jana, Straka Milan, Hajič Jan: Neural Architectures for Nested NER through Linearization. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Copyright © Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-950737-48-2, pp. 5326-5331, 2019.
6. Spanish CoNLL Model
The Spanish model is trained on training part of the
CoNLL-2002 NER annotations (Tjong Kim Sang, 2002).
The corpus uses four classes: PER
, ORG
, LOC
and MISC
.
The latest version is 200831
, distributed by LINDAT.
The model spanish-conll-200831
reaches 88.55 F1-measure on the CoNLL-2002
test data.
6.1. Acknowledgements
The work described herein has been supported by OP VVV VI LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project CZ.02.1.01/0.0/0.0/16 013/0001781) and it has been supported by LINDAT/CLARIAH-CZ project of the Ministry of Education, Youth and Sports of the Czech Republic (LM2023062, LM2018101). It has also been supported by the Mellon Foundation grant No. G-1901-06505. It has further been supported by LUSyD GX20-16819X.
The research was carried out by Jana Straková and Milan Straka.
All models use UDPipe for tokenization.
6.1.1. Publications
Straková Jana, Straka Milan, Hajič Jan: Neural Architectures for Nested NER through Linearization. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Copyright © Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-950737-48-2, pp. 5326-5331, 2019.
7. Ukrainian lang-uk Model
The Ukrainian model is trained on Ukrainian lang-uk NER corpus based on the Lang-uk initiative. The corpus uses four classes PER
, ORG
, LOC
and MISC
(please note that we changed the original PERS
to more common PER
). The corpus was split randomly into train/dev/test in ratio 8:1:1. The released model ukrainian-languk-230306
reaches 88.73 F1 measure on the test split.
The latest version is 230306
, soon to be distributed by LINDAT.
7.1. Acknowledgements
The work described herein has been supported by OP VVV VI LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project CZ.02.1.01/0.0/0.0/16 013/0001781) and it has been supported by LINDAT/CLARIAH-CZ project of the Ministry of Education, Youth and Sports of the Czech Republic (LM2023062, LM2018101). It has also been supported by the Mellon Foundation grant No. G-1901-06505. It has further been supported by LUSyD GX20-16819X.
The research was carried out by Jana Straková and Milan Straka.
All models use UDPipe for tokenization.
7.1.1. Publications
Straková Jana, Straka Milan, Hajič Jan: Neural Architectures for Nested NER through Linearization. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Copyright © Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-950737-48-2, pp. 5326-5331, 2019.