NameTag 3 Models
- Results at a Glance
- Czech CNEC 2.0 Model
- Multilingual Model
- 3.1. Arabic CoNLL-2012 OntoNotes v5
- 3.2. Chinese CoNLL-2012 OntoNotes v5
- 3.3. Chinese UNER GSDSIMP
- 3.4. Chinese UNER GSD
- 3.5. Croatian UNER SET
- 3.6. Czech CNEC 2.0 CoNLL (4 labels, flat)
- 3.7. Danish UNER DDT
- 3.8. Dutch CoNLL-2002
- 3.9. English CoNLL2012 OntoNotes v5
- 3.10. English UNER EWT
- 3.11. English CoNLL-2003
- 3.12. German CoNLL-2003
- 3.13. Maghrebi Arabic French UNER Arabizi
- 3.14. Norwegian bokmaal UNER NDT
- 3.15. Norwegian nynorsk UNER NDT
- 3.16. Portuguese UNER Bosque
- 3.17. Serbian UNER SET
- 3.18. Slovak UNER SNK
- 3.19. Spanish CoNLL-2002
- 3.20. Swedish UNER Talbanken
- 3.21. Ukrainian Lang-uk
- 3.22. Acknowledgements
- Multilingual CoNLL Model
In natural language text, the task of (nested) named entity recognition (NER) is to identify proper names such as names of persons, organizations and locations.
As a supervised machine learning tool, NameTag needs a trained linguistic model. This section describes the trained models available for NameTag 3.
All models are available under the CC BY-NC-SA licence and can be downloaded from the LINDAT repository.
The models are versioned according to the date when released, the version
format is YYMMDD
, where YY
, MM
and DD
are two-digit
representation of year, month and day, respectively.
The latest version is 240830
for the Czech CNEC 2.0 model, and 250203
for the Multilingual
model.
1. Results at a Glance
2. Czech CNEC 2.0 Model
The Czech CNEC 2.0 model is trained on the training part of the Czech Named Entity Corpus 2.0 (Ševčíková et al., 2007).
The corpus uses 46 atomic named entity types, which can be embedded,
e.g., the river name Labe
can be part of a name of a city as in <gu Ústí nad <gh Labem>>
.
In parallel, the corpus is also annotated with 7 coarser, one-character supertypes, also potentially nested. Furthermore, there are also 4 so-called NE (named entity) containers: two or more NEs are
parts of a NE container (e.g., two NEs, a first name and a surname, form
together a person name NE container such as in <P <pf Jan><ps Novák>>
).
The 4 NE containers are marked with a capital one-letter tag: P
for
(complex) person names, T
for temporal expressions, A
for addresses,
and C
for bibliographic items.
The latest version is nametag3-czech-cnec2.0-240830
, distributed by LINDAT.
The model nametag3-czech-cnec2.0-240830
reaches 86.39 F1-measure for the fine-grained,
two-character types and 89.29 for
the coarse, one-character supertypes on the
CNEC2.0 test data.
2.1. Acknowledgements
This work has been supported by the Grant Agency of the Czech Republic under the EXPRO program as project “LUSyD” (project No. GX20-16819X). The work described herein has also been using data provided by the LINDAT/CLARIAH-CZ Research Infrastructure, supported by the Ministry of Education, Youth and Sports of the Czech Republic (Project No. LM2023062).
Czech CNEC 2.0 model is trained on Czech Named Entity Corpus 2.0, which was created by Magda Ševčíková, Zdeněk Žabokrtský, Jana Straková and Milan Straka.
The research was carried out by Jana Straková and Milan Straka.
All models use UDPipe for tokenization.
2.1.1. Publications
Straková Jana, Straka Milan, Hajič Jan: Neural Architectures for Nested NER through Linearization. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Copyright © Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-950737-48-2, pp. 5326-5331, 2019.
Straka Milan, Straková Jana, Hajič Jan: Czech Text Processing with Contextual Embeddings: POS Tagging, Lemmatization, Parsing and NER. In: Lecture Notes in Computer Science, Vol. 11697, Proceedings of the 22nd International Conference on Text, Speech and Dialogue - TSD 2019, Copyright © Springer International Publishing, Cham / Heidelberg / New York / Dordrecht / London, ISBN 978-3-030-27946-2, ISSN 0302-9743, pp. 137-150, 2019.
Straková Jana, Straka Milan, Hajič Jan, Popel Martin: Hluboké učení v automatické analýze českého textu. In: Slovo a slovesnost, Vol. 80, No. 4, Copyright © Ústav pro jazyk český AV ČR, Prague, Czech Republic, ISSN 0037-7031, pp. 306-327, Dec 2019.
3. Multilingual Model
This section describes the multilingual model published with NameTag 3.1.
Since NameTag 3.1, NameTag can be trained with multiple named entity tagsets. The trained model can then be required to recognize the named entities using a specific tagset during inference, or a default tagset will be used if none was requested.
The latest version is nametag3-multilingual-250203
, and is distributed by
LINDAT. This model was trained on 17
languages of 21 datasets, and it can be used to recognize the following
tagsets:
conll
(default): The CoNLL-2003 shared task tagset:PER
,ORG
,LOC
, andMISC
. Used when callingnametag3.py
prediction with--tagsets=conll
or by requestingnametag3-multilingual-conll-250203
from the NameTag 3 webservice.uner
: The Universal NER v1 tagset:PER
,ORG
,LOC
. Used when callingnametag3.py
with--tagsets=uner
or by requestingnametag3-multilingual-uner-250203
from the NameTag 3 webservice.onto
: The OntoNotes v5 tagset:PERSON
,NORP
,FAC
,ORG
,GPE
, etc. Used when callingnametag3.py
with--tagsets=onto
or by requestingnametag3-multilingual-onto-250203
from the NameTag 3 webservice.
3.1. Arabic CoNLL-2012 OntoNotes v5
The Arabic training corpus is the training part of the OntoNotes v5 Arabic corpus with the CoNLL-2012 train/dev/test split.
The model nametag3-multilingual-250203
with --tagsets=onto
reaches 74.20 span-based micro F1 on the CoNLL-2012 OntoNotes v5 test data.
3.2. Chinese CoNLL-2012 OntoNotes v5
One of the Chinese training corpora is the training part of the OntoNotes v5 Chinese corpus with the CoNLL-2012 train/dev/test split.
The model nametag3-multilingual-250203
with --tagsets=onto
reaches 81.63 span-based micro F1 on the CoNLL-2012 OntoNotes v5 test data.
3.3. Chinese UNER GSDSIMP
One of the Chinese training corpora is the training part of the Universal NER v1 Chinese GSDSIMP corpus.
The model nametag3-multilingual-250203
with --tagsets=uner
reaches 90.99 span-based micro F1 on the UNER test data.
3.4. Chinese UNER GSD
One of the Chinese training corpora is the training part of the Universal NER v1 Chinese GSD corpus.
The model nametag3-multilingual-250203
with --tagsets=uner
reaches 91.53 span-based micro F1 on the UNER test data.
3.5. Croatian UNER SET
The Croatian training corpus is the training part of the Universal NER v1 Croatian SET corpus.
The model nametag3-multilingual-250203
with --tagsets=uner
reaches 95.55 span-based micro F1 on the UNER test data.
3.6. Czech CNEC 2.0 CoNLL (4 labels, flat)
In order to train and serve the Czech Named Entity Corpus 2.0 (Ševčíková et al., 2007) jointly within a large multilingual model, the original annotation of the CNEC 2.0 has been harmonized to the standard 4-label tagset with PER
, ORG
, LOC
, and MISC
, resulting in an extensive simplification of the original annotation and flattening of the original nested entities.
The script for the automated conversion to the 4-label CoNLL-2003 tagset can be found in the NameTag 3 GitHub repository.
If you are interested in the original CNEC 2.0 model with the complete 46 labels and nested entities, see the Czech CNEC 2.0 model.
The model nametag3-multilingual-250203
with --tagsets=conll
reaches 86.24 span-based micro F1 on the simplified flat named entity tags PER
, ORG
, LOC
, and MISC
.
3.7. Danish UNER DDT
The Danish training corpus is the training part of the Universal NER v1 Danish DDT corpus.
The model nametag3-multilingual-250203
with --tagsets=uner
reaches 89.75 span-based micro F1 on the UNER test data.
3.8. Dutch CoNLL-2002
The Dutch training corpus is the training part of the CoNLL-2002 NE annotations (Tjong Kim Sang, 2002) of part of Reuters Corpus.
The model nametag3-multilingual-250203
with --tagsets=conll
reaches 94.93 span-based micro F1 on the CoNLL-2002 test data.
3.9. English CoNLL2012 OntoNotes v5
One of the English training corpora is the training part of the OntoNotes v5 English corpus with the CoNLL-2012 train/dev/test split.
The model nametag3-multilingual-250203
with --tagsets=onto
reaches 90.19 span-based micro F1 on the CoNLL-2012 OntoNotes v5 test data.
3.10. English UNER EWT
One of the English training corpora is the training part of the Universal NER v1 English EWT corpus.
The model nametag3-multilingual-250203
with --tagsets=uner
reaches 87.03 span-based micro F1 on the UNER test data.
3.11. English CoNLL-2003
One of the English training corpora is the training part of the CoNLL-2003 NE annotations (Sang and De Meulder, 2003) of part of Reuters Corpus.
The model nametag3-multilingual-250203
with --tagsets=conll
reaches 94.03 span-based micro F1 on the CoNLL-2003 test data.
3.12. German CoNLL-2003
One of the German training corpora is the training part of the CoNLL-2003 NE annotations (Sang and De Meulder, 2003) of part of Reuters Corpus.
The model nametag3-multilingual-250203
with --tagsets=conll
reaches 87.48 span-based micro F1 on the CoNLL-2003 test data.
3.13. Maghrebi Arabic French UNER Arabizi
The Maghrebi Arabic French training corpus is the training part of the Universal NER v1 Maghrebi Arabic French Arabizi corpus.
The model nametag3-multilingual-250203
with --tagsets=uner
reaches 84.49 span-based micro F1 on the UNER test data.
3.14. Norwegian bokmaal UNER NDT
The Norwegian Bokmål training corpus is the training part of the Universal NER v1 Norwegian Bokmål NDT corpus.
The model nametag3-multilingual-250203
with --tagsets=uner
reaches 95.83 span-based micro F1 on the UNER test data.
3.15. Norwegian nynorsk UNER NDT
The Norwegian Nynorsk training corpus is the training part of the Universal NER v1 Norwegian Nynorsk NDT corpus.
The model nametag3-multilingual-250203
with --tagsets=uner
reaches 94.51 span-based micro F1 on the UNER test data.
3.16. Portuguese UNER Bosque
The Portuguese training corpus is the training part of the Universal NER v1 Portuguese Bosque corpus.
The model nametag3-multilingual-250203
with --tagsets=uner
reaches 90.89 span-based micro F1 on the UNER test data.
3.17. Serbian UNER SET
The Serbian training corpus is the training part of the Universal NER v1 Serbian SET corpus.
The model nametag3-multilingual-250203
with --tagsets=uner
reaches 97.10 span-based micro F1 on the UNER test data.
3.18. Slovak UNER SNK
The Slovak training corpus is the training part of the Universal NER v1 Slovak SNK corpus.
The model nametag3-multilingual-250203
with --tagsets=uner
reaches 88.46 span-based micro F1 on the UNER test data.
3.19. Spanish CoNLL-2002
The Spanish training corpus is the training part of the CoNLL-2002 NE annotations (Tjong Kim Sang, 2002) of part of Reuters Corpus.
The model nametag3-multilingual-250203
with --tagsets=conll
reaches 90.29 span-based micro F1 on the CoNLL-2002 test data.
3.20. Swedish UNER Talbanken
The Swedish training corpus is the training part of the Universal NER v1 Swedish Talbanken corpus.
The model nametag3-multilingual-250203
with --tagsets=uner
reaches 91.79 span-based micro F1 on the UNER test data.
3.21. Ukrainian Lang-uk
The Ukrainian language is trained on the Ukrainian Lang-uk NER corpus based on the Lang-uk initiative. The corpus uses four classes PER
, ORG
, LOC
, and MISC
(please note that we harmonized the original PERS
to the common PER
). The corpus was split randomly into train/dev/test in ratio 8:1:1.
The model nametag3-multilingual-250203
with --tagsets=conll
reaches 92.88 span-based micro F1 on the test split.
3.22. Acknowledgements
This work has been supported by the MŠMT OP JAK program, project No. CZ.02.01.01/00/22_008/0004605 and by the Grant Agency of the Czech Republic under the EXPRO program as project “LUSyD” (project No. GX20-16819X). The work described herein has also been using data provided by the [LINDAT/CLARIAH-CZ Research Infrastructure https://lindat.cz], supported by the Ministry of Education, Youth and Sports of the Czech Republic (Project No. LM2023062).
The research was carried out by Jana Straková and Milan Straka.
All models use UDPipe for tokenization.
3.22.1. Publications
Straková Jana, Straka Milan, Hajič Jan: Neural Architectures for Nested NER through Linearization. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Copyright © Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-950737-48-2, pp. 5326-5331, 2019.
4. Multilingual CoNLL Model
The NameTag 3 multilingual model is trained on the training data of the following corpora:
- English CoNLL-2003,
- German CoNLL-2003,
- Dutch CoNLL-2002,
- Spanish CoNLL-2003,
- Ukrainian Lang-uk,
- Czech CNEC 2.0 CoNLL.
The multilingual model uses four classes: PER
, ORG
, LOC
and MISC
.
The latest version is nametag3-multilingual-conll-240830
, distributed by LINDAT.
4.1. English CoNLL-2003
The NameTag 3 English language is trained and served within a NameTag 3 multilingual-conll model. The English language is trained on the training part of the CoNLL-2003 NER annotations (Sang and De Meulder, 2003) of part of Reuters Corpus. The corpus uses four classes PER
, ORG
, LOC
and MISC
.
The model nametag3-multilingual-conll-240830
reaches 93.85 F1-measure on the CoNLL-2003 test data.
4.2. German CoNLL-2003
The NameTag 3 German model is trained and served within a NameTag 3 multilingual-conll model.
The German language is trained on the training part of the CoNLL-2003 NER annotations (Sang and De Meulder, 2003) of part of Reuters Corpus. The corpus uses four classes PER
, ORG
, LOC
and MISC
.
The model nametag3-multilingual-conll-240830
reaches 87.07 F1-measure on the CoNLL-2003 test data.
4.3. Dutch CoNLL-2002
The NameTag 3 Dutch model is trained and served within a NameTag 3 multilingual-conll model.
The Dutch language is trained on the training part of the CoNLL-2002 NER annotations (Tjong Kim Sang, 2002). The corpus uses four classes: PER
, ORG
, LOC
and MISC
.
The model nametag3-multilingual-conll-240830
reaches 94.42 F1-measure on the CoNLL-2002 test data.
4.4. Spanish CoNLL-2002
The NameTag 3 Spanish model is trained and served within a NameTag 3 multilingual-conll model.
The Spanish language is trained on the training part of the CoNLL-2002 NER annotations (Tjong Kim Sang, 2002). The corpus uses four classes: PER
, ORG
, LOC
and MISC
.
The model nametag3-multilingual-conll-240830
reaches 89.90 F1-measure on the CoNLL-2002 test data.
4.5. Ukrainian Lang-uk
The NameTag 3 Ukrainian model is trained and served within a NameTag 3 multilingual-conll model.
The Ukrainian language is trained on the Ukrainian Lang-uk NER corpus based on the Lang-uk initiative. The corpus uses four classes PER
, ORG
, LOC
and MISC
(please note that we harmonized the original PERS
to the common PER
). The corpus was split randomly into train/dev/test in ratio 8:1:1.
The model nametag3-multilingual-languk_conll-240830
reaches 91.73 F1 measure on the test split.
4.6. Czech CNEC 2.0 CoNLL (4 labels, flat)
In order to train and serve the Czech Named Entity Corpus 2.0 (Ševčíková et al., 2007) jointly within a large multilingual model, the original annotation of the Czech Named Entity Corpus 2.0 (Ševčíková et al., 2007) has been harmonized to the standard 4-label PER
, ORG
, LOC
, MISC
CoNLL format, resulting in an extensive simplification of the original annotation and flattening of the original nested entities.
The script for the automated conversion to the 4-label CoNLL-2003 format can be found at the NameTag 3 GitHub repository.
If you are interested in the original Czech Named Entity Corpus 2.0 (Ševčíková et al., 2007) model with the complete 46 labels and nested entities, see the Czech CNEC 2.0 model.
The model nametag3-multilingual-conll-240830
reaches 86.35 F1-measure on the simplified flat named entities labeled with PER
, ORG
, LOC
, and MISC
.
4.7. Acknowledgements
This work has been supported by the Grant Agency of the Czech Republic under the EXPRO program as project “LUSyD” (project No. GX20-16819X). The work described herein has also been using data provided by the LINDAT/CLARIAH-CZ Research Infrastructure, supported by the Ministry of Education, Youth and Sports of the Czech Republic (Project No. LM2023062).
The research was carried out by Jana Straková and Milan Straka.
All models use UDPipe for tokenization.
4.7.1. Publications
Straková Jana, Straka Milan, Hajič Jan: Neural Architectures for Nested NER through Linearization. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Copyright © Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-950737-48-2, pp. 5326-5331, 2019.