NameTag 3 Models

In natural language text, the task of nested named entity recognition (NER) is to identify proper names such as names of persons, organizations and locations. NameTag 3 identifies and classifies nested named entities by an algorithm described in Straková et al. 2019.

Like any supervised machine learning tool, NameTag needs a trained linguistic model. This section describes the language models available for NameTag 3.

All models are available under the CC BY-NC-SA licence and can be downloaded from LINDAT repository. The latest version is 240830.

The models work in NameTag version 3.

All models use UDPipe for tokenization.

The models are versioned according to the date when released, the version format is YYMMDD, where YY, MM and DD are two-digit representation of year, month and day, respectively. The latest version is240830.

1. Results at a Glance

Corpus NameTag 2 NameTag 3 NameTag 3 Model
CNEC 2.0 fine-grained (nested) 83.44 86.39 nametag3-czech-cnec2.0-240830
CNEC 2.0 coarse (nested) 87.04 89.29 nametag3-czech-cnec2.0-240830
English CoNLL-2003 (flat) 91.68 93.85 nametag3-multilingual-conll-240830
German CoNLL-2003 (flat) 82.65 87.07 nametag3-multilingual-conll-240830
Dutch CoNLL-2002 (flat) 91.17 94.42 nametag3-multilingual-conll-240830
Spanish CoNLL-2002 (flat) 88.55 89.90 nametag3-multilingual-conll-240830
Ukrainian Lang-uk (flat) 88.73 91.73 nametag3-multilingual-conll-240830
CNEC 2.0 CoNLL (4 labels, flat) N/A 86.35 nametag3-multilingual-conll-240830

2. Czech CNEC 2.0 Model

The Czech CNEC 2.0 model is trained on the training part of the Czech Named Entity Corpus 2.0 (Ševčíková et al., 2007).

The corpus uses 46 atomic named entity types, which can be embedded, e.g., the river name Labe can be part of a name of a city as in <gu Ústí nad <gh Labem>>. In parallel, the corpus is also annotated with 7 coarser, one-character supertypes, also potentially nested. Furthermore, there are also 4 so-called NE (named entity) containers: two or more NEs are parts of a NE container (e.g., two NEs, a first name and a surname, form together a person name NE container such as in <P <pf Jan><ps Novák>>). The 4 NE containers are marked with a capital one-letter tag: P for (complex) person names, T for temporal expressions, A for addresses, and C for bibliographic items.

The latest version is nametag3-czech-cnec2.0-240830, distributed by LINDAT.

The model nametag3-czech-cnec2.0-240830 reaches 86.39 F1-measure for the fine-grained, two-character types and 89.29 for the coarse, one-character supertypes on the CNEC2.0 test data.

2.1. Acknowledgements

This work has been supported by the Grant Agency of the Czech Republic under the EXPRO program as project “LUSyD” (project No. GX20-16819X). The work described herein has also been using data provided by the LINDAT/CLARIAH-CZ Research Infrastructure, supported by the Ministry of Education, Youth and Sports of the Czech Republic (Project No. LM2023062).

Czech CNEC 2.0 model is trained on Czech Named Entity Corpus 2.0, which was created by Magda Ševčíková, Zdeněk Žabokrtský, Jana Straková and Milan Straka.

The research was carried out by Jana Straková and Milan Straka.

All models use UDPipe for tokenization.

2.1.1. Publications

Straková Jana, Straka Milan, Hajič Jan: Neural Architectures for Nested NER through Linearization. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Copyright © Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-950737-48-2, pp. 5326-5331, 2019.

Straka Milan, Straková Jana, Hajič Jan: Czech Text Processing with Contextual Embeddings: POS Tagging, Lemmatization, Parsing and NER. In: Lecture Notes in Computer Science, Vol. 11697, Proceedings of the 22nd International Conference on Text, Speech and Dialogue - TSD 2019, Copyright © Springer International Publishing, Cham / Heidelberg / New York / Dordrecht / London, ISBN 978-3-030-27946-2, ISSN 0302-9743, pp. 137-150, 2019.

Straková Jana, Straka Milan, Hajič Jan, Popel Martin: Hluboké učení v automatické analýze českého textu. In: Slovo a slovesnost, Vol. 80, No. 4, Copyright © Ústav pro jazyk český AV ČR, Prague, Czech Republic, ISSN 0037-7031, pp. 306-327, Dec 2019.

3. Multilingual CoNLL Model

The NameTag 3 multilingual model is trained on the training data of the following corpora:

The multilingual model uses four classes: PER, ORG, LOC and MISC.

The latest version is nametag3-multilingual-conll-240830, distributed by LINDAT.

3.1. English CoNLL

The NameTag 3 English language is trained and served within a NameTag 3 multilingual-conll model. The English language is trained on the training part of the CoNLL-2003 NER annotations (Sang and De Meulder, 2003) of part of Reuters Corpus. The corpus uses four classes PER, ORG, LOC and MISC.

The model nametag3-multilingual-conll-240830 reaches 93.85 F1-measure on the CoNLL-2003 test data.

3.2. German CoNLL

The NameTag 3 German model is trained and served within a NameTag 3 multilingual-conll model. The German language is trained on the training part of the CoNLL-2003 NER annotations (Sang and De Meulder, 2003) of part of Reuters Corpus. The corpus uses four classes PER, ORG, LOC and MISC.

The model nametag3-multilingual-conll-240830 reaches 87.07 F1-measure on the CoNLL-2003 test data.

3.3. Dutch CoNLL

The NameTag 3 Dutch model is trained and served within a NameTag 3 multilingual-conll model. The Dutch language is trained on the training part of the CoNLL-2002 NER annotations (Tjong Kim Sang, 2002). The corpus uses four classes: PER, ORG, LOC and MISC.

The model nametag3-multilingual-conll-240830 reaches 94.42 F1-measure on the CoNLL-2002 test data.

3.4. Spanish CoNLL

The NameTag 3 Spanish model is trained and served within a NameTag 3 multilingual-conll model. The Spanish language is trained on the training part of the CoNLL-2002 NER annotations (Tjong Kim Sang, 2002). The corpus uses four classes: PER, ORG, LOC and MISC.

The model nametag3-multilingual-conll-240830 reaches 89.90 F1-measure on the CoNLL-2002 test data.

3.5. Ukrainian Lang-uk

The NameTag 3 Ukrainian model is trained and served within a NameTag 3 multilingual-conll model.

The Ukrainian language is trained on the Ukrainian Lang-uk NER corpus based on the Lang-uk initiative. The corpus uses four classes PER, ORG, LOC and MISC (please note that we harmonized the original PERS to the common PER). The corpus was split randomly into train/dev/test in ratio 8:1:1.

The model nametag3-multilingual-languk_conll-240830 reaches 91.73 F1 measure on the test split.

3.6. Czech CNEC 2.0 CoNLL (4 labels, flat)

In order to train and serve the Czech Named Entity Corpus 2.0 (Ševčíková et al., 2007) jointly within a large multilingual model, the original annotation of the Czech Named Entity Corpus 2.0 (Ševčíková et al., 2007) has been harmonized to the standard 4-label PER, ORG, LOC, MISC CoNLL format, resulting in an extensive simplification of the original annotation and flattening of the original nested entities.

The script for the automated conversion to the 4-label CoNLL-2003 format can be found at the NameTag 3 GitHub repository.

If you are interested in the original Czech Named Entity Corpus 2.0 (Ševčíková et al., 2007) model with the complete 46 labels and nested entities, see the Czech CNEC 2.0 model.

The model nametag3-multilingual-conll-240830 reaches 86.35 F1-measure on the simplified flat named entities labeled with PER, ORG, LOC, and MISC.

3.7. Acknowledgements

This work has been supported by the Grant Agency of the Czech Republic under the EXPRO program as project “LUSyD” (project No. GX20-16819X). The work described herein has also been using data provided by the LINDAT/CLARIAH-CZ Research Infrastructure, supported by the Ministry of Education, Youth and Sports of the Czech Republic (Project No. LM2023062).

The research was carried out by Jana Straková and Milan Straka.

All models use UDPipe for tokenization.

3.7.1. Publications

Straková Jana, Straka Milan, Hajič Jan: Neural Architectures for Nested NER through Linearization. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Copyright © Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-950737-48-2, pp. 5326-5331, 2019.