NameTag 3 Training Tutorial

1. Introduction

This tutorial demonstrates how to train your own NameTag 3 model for named entity recognition (NER) using custom datasets. We'll walk through several common training scenarios to help you get started.

If you have your own manually annotated dataset for named entity recognition, this tutorial will guide you through the process of training a NameTag 3 model tailored to your data. We'll explore different use cases and strategies.

All scripts used in this tutorial can be found in the NameTag 3 GitHub repository.

2. Cloning and Installing NameTag 3

Start by cloning the NameTag 3 repository and setting it up by following the installation instructions provided in the NameTag 3 README.

If you're planning to train NameTag 3 from scratch on your own dataset and do not intend to use any of the pre-trained models, you can safely skip step 5 (downloading and unzipping the pre-trained models) in the installation process.

3. Example Dataset

For demonstration purposes, we'll use the Czech Historical Named Entity Corpus (CHNEC). This dataset presents an interesting challenge for several reasons:

  1. There is currently no pre-trained NameTag 3 model for CHNEC.
  2. NameTag 3 does offer several Czech models that could serve as a foundation for transfer learning.
  3. CHNEC uses a unique annotation scheme that differs from existing models, but it can be adapted if needed.

You can download the corpus from http://chnec.kiv.zcu.cz. For more details about CHNEC, see Hubková et al. (2020).

The strict, span-based micro-average F1 reported for CHNEC in Hubková et al. (2020) is 73.00.

4. Data Preprocessing

The input format for NameTag 3 is the following:

The input data file format is a vertical file, one token and its label(s) per line: labels separated by a |, columns separated by a tabulator; sentences delimited by newlines (such as the first and the fourth column in the well-known CoNLL-2003 shared task). A line containing -DOCSTART- with the label O, as seen in the CoNLL-2003 shared task data, can be used to mark document boundaries. Input examples can be found in nametag3.py and in examples.

John	B-PER
loves	O
Mary	B-PER
.	O

Mary	B-PER
loves	O
John	B-PER
.	O

When preparing your data for NameTag 3, keep an eye out for these frequent issues:

  • Using spaces instead of tabs to separate columns. NameTag 3 expects columns to be separated by tab characters (\t), not spaces or other types of whitespace.
  • Not separating sentences by newlines. Each sentence should be followed by a blank line to indicate sentence boundaries.

Furthermore, it is advantageous to use the IOB2 encoding instead of the more permissive IOB encoding, as the IOB2 format is more consistent and therefore suitable for many sequence labeling models, including NameTag 3. For flat NEs, you can convert your data using the provided preprocessing/iob_to_iob2.py script. (In our preliminary experiments, converting CHNEC from the less consistent IOB to the more consistent IOB2 format improved performance by approximately 1 F1 point on Base-sized models and 2 F1 points on Large-sized models.)

To automatically convert the CHNEC training, development, and test files into the expected NameTag 3 format, you can use the following simple shell command:

for f in train dev test; do
  cut -f1,4 -d" " CHNEC_v0.1_$f.conll | tr " " "\t" | preprocessing/iob_to_iob2.py > $f.conll
done

This command does the following for each file (train, dev, and test):

  • Extracts the first and fourth columns, which correspond to the tokens and named entity labels.
  • Replaces the default space delimiter with a tab character.
  • Converts the IOB format to the IOB2 format using the preprocessing/iob_to_iob2.py script.
  • Outputs the cleaned version to a new file (e.g., train.conll, dev.conll, test.conll).

5. Training NameTag 3 From Scratch on Your Dataset

Our first experiment will involve training a NameTag 3 model from scratch using only our own dataset without leveraging any existing NameTag 3 pre-trained models or other Czech corpora.

NameTag 3 is trained via the script nametag3.py, which handles both training and evaluation. To train on a custom dataset, you’ll primarily need to provide the --train_data parameter. Optionally, you can also specify --dev_data, --test_data, and other training hyperparameters.

In our case, we will use

--train_data=train.conll --dev_data=dev.conll --test_data=test.conll

Let’s now go through the key training options:

  1. Foundation model: You'll need a Transformer-based foundation model. For multilingual tasks, use models such as xlm-roberta-base or xlm-roberta-large. These models also perform well on monolingual datasets. Alternatively, consider a monolingual model if available for your language. For Czech CHNEC, we'll try ufal/robeczech-base besides xlm-roberta-base and xlm-roberta-large.
  2. Decoding type: If your corpus is flat, set --decoding=classification. If it contains nested entities, use --decoding=seq2seq. In this example, our nested entities are flat, so we will have --decoding=classification.
  3. Batch size: Adjust --batch_size based on available memory. Valid values are typically 4, 8, 16, or 32. Smaller batch sizes also act as a form or regularization, which is helpful if your dataset is small. Since CHNEC is relatively small, we recommend --batch_size=4.
  4. Context type: If your corpus includes document boundaries, you can optionally mark them using a special token:

    -DOCSTART-	O
    
    John	B-PER
    loves	O
    Mary	B-PER
    .	O
    
    

    Then, set --context_type=split_document. Even if document boundaries are not explicitly marked but sentences belong to documents, split_documents often works better. If your data consists of isolated sentences with no document context, use --context_type=sentence. Please note that training on document-based context is currently implemented for flat entities only.

  5. Epochs: A good starting point is --epochs=20. NameTag 3 will automatically select the best checkpoint based on development set performance.
  6. Learning rate: For RoBERTa-based models, use a conservative learning rate: --learning_rate=2e-5. You can experiment with slightly higher learning rates if you're using BERT-based models instead.

We provide three training scripts for this setup:

These scripts define all required parameters for running nametag3.py on CHNEC. You can use them as templates for training on your own dataset.

For training on command line, use the training commands straightforwardly from the root directory of the NameTag 3 repository:

venv/bin/python3 nametag3.py \
  --batch_size=4 \
  --context_type="split_document" \
  --corpus="czech-chnec" \
  --decoding="classification" \
  --dev_data="dev.conll" \
  --epochs=20 \
  --evaluate_test_data \
  --hf_plm="xlm-roberta-large" \
  --learning_rate=2e-5 \
  --test_data="test.conll" \
  --train_data="train.conll"

Alternatively, you can submit the entire script to a cluster computation with a Slurm managing system:

sbatch -C "gpuram40G|gpuram48G" --mem=8G ./02_chnec_xlmr-large.sh

Also see the exact training scripts for the published NameTag 3 models for further inspiration on setting the hyperparameters:

Now, let's inspect the performance of our trained models:

Model Training Data F1 (p, i, g, o, a, t)
Hubková et al. (2020) CHNEC 73.00
xlm-roberta-base CHNEC 81.92
ufal/robeczech-base CHNEC 83.07
xlm-roberta-large CHNEC 85.15

As we can see, the Large-sized model XLM-RoBERTa Large outperforms both Base-sized models, but the Czech monolingual Base-sized model RobeCzech outperforms the multilingual XLM-RoBERTa Base.

6. Training on a Collection of Corpora with a Shared Tagset

As a rule of thumb, the more data for machine learning, the better. If there is another manually annotated corpus with the same annotation scheme available, NameTag 3 can train on this collection of datasets.

For reference, the NameTag 3 Multilingual CoNLL Model is trained in exactly this scenario. There are six corpora of six languages (English, German, Dutch, Spanish, Czech, and Ukrainian), all of the same annotation scheme. See the corresponding training script.

Now, let's get back to this tutorial:

For Czech, besides the Czech Historical Named Entity Corpus (CHNEC), there is the canonical Czech Named Entity Corpus. Let's explore and compare both annotation schemes:

CHNEC is a flat corpus with NE labels p (person), i (institution), g (geographical), t (time), o (artifacts, objects), and a (ambiguous), while CNEC is a nested NE corpus with 46 NE labels and 4 NE containers.

Our situation would be much easier if these two corpora were already annotated in the same tagset but all is not lost. You can skip the next label adjustment step if your NE labels already match across your collection of the datasets.

6.1. Harmonizing the Tagset Across Datasets

The goal is to bring our collection of corpora to the same annotation scheme. Here, we will specifically aim at mapping both the CNEC and the CHNEC to the CoNLL-2003 NER annotations (Sang and De Meulder, 2003) using the standard four classes PER, ORG, LOC and MISC.

We will start by downloading the Czech Named Entity Corpus. The latest version is 2.0 and is distributed by LINDAT.

This corpus is distributed in the legacy Treex format. The entire pipeline of converting the CNEC legacy Treex files into the NameTag 3 CoNLL input format with some cleaning, flattening the corpus from nested to flat NEs, and finally, converting the NE annotation scheme to the CoNLL 4-labels, would look like this:

cat named_ent_train.treex | preprocessing/cnec2.0_treex2conll2003_nested.pl | tr " " "\t" | sed "s/^\s*$//" | cut -f 1,5 | preprocessing/map_cnec2.0_labels_to_conll.py | preprocessing/iob_to_iob2.py > train.conll
cat named_ent_dtest.treex | preprocessing/cnec2.0_treex2conll2003_nested.pl | tr " " "\t" | sed "s/^\s*$//" | cut -f 1,5 | preprocessing/map_cnec2.0_labels_to_conll.py | preprocessing/iob_to_iob2.py > dev.conll
cat named_ent_etest.treex | preprocessing/cnec2.0_treex2conll2003_nested.pl | tr " " "\t" | sed "s/^\s*$//" | cut -f 1,5 | preprocessing/map_cnec2.0_labels_to_conll.py | preprocessing/iob_to_iob2.py > test.conll

In this pipeline, we take the original CNEC 2.0 train (train), dtest (dev) and etest (test) files. Each file gets converted from the legacy Treex format with preprocessing/cnec2.0_treex2conll2003_nested.pl, and the 46 CNEC 2.0 labels are mapped to just 4 labels PER, ORG, LOC, MISC with preprocessing/map_cnec2.0_labels_to_conll.py.

Similarly, but fortunately much more easily, we can automatically re-annotate the CHNEC labels straightforwardly, as this corpus is already almost in the NameTag 3 input format:

Just like before, we convert the CHNEC files to the NameTag 3 input format, with additional sed commands to replace the NE labels to our desired mapping:

for f in train dev test; do
  cut -f1,4 -d" " CHNEC_v0.1_$f.conll | sed "s/-p$/-PER/g; s/-i$/-ORG/g; s/-g$/-LOC/g; s/B-t$/O/g; s/I-t$/O/g; s/-o$/-MISC/g; s/-a$/-MISC/g" | tr " " "\t" | ./iob_to_iob2.py > $f.conll
done

It is important to ensure not only formal harmonization of the tagsets but also a comparable annotation style across datasets. For instance, we removed the time (t) label from the annotation scheme, as it is not present in the CoNLL-2003 NER dataset (Sang and De Meulder, 2003). When training on a collection of corpora that nominally share the same tagset (PER, ORG, LOC, MISC), but exhibit divergent annotation practices or contextual label usage, the model must allocate additional capacity to infer both when and how to assign entity labels.

6.2. Training on a Collection of Corpora with a Shared Tagset

Now we train the CHNEC-conll both individually (training script) and in collection with the CNEC-conll (training script).

Model Training Data F1 (p, i, g, o, a, t) F1 (PER, ORG, LOC, MISC)
Hubková et al. (2020) CHNEC 73.00
xlm-roberta-base CHNEC 81.92
ufal/robeczech-base CHNEC 83.07
xlm-roberta-large CHNEC 85.15 86.73
xlm-roberta-large CHNEC + CNEC 89.13

Please note that the numbers on the conll-transformed labels (PER, ORG, LOC, MISC) are not directly comparable with the performance on the original CHNEC annotation scheme (p, i, g, o, a, t).

Let's dissect the CHNEC-conll + CNEC-conll training script:

We supplied the train, dev, and test data of both corpora to NameTag 3, files separated by commas:

--train_data=czech-cnec2.0-conll/train.conll,czech-chnec-conll/train.conll \
--dev_data=czech-cnec2.0-conll/dev.conll,czech-chnec-conll/dev.conll \
--test_data=czech-cnec2.0-conll/test.conll,czech-chnec-conll/test.conll

We also told NameTag 3 the names of the corpora, also delimited by a comma:

--corpus=czech-cnec2.0-conll,czech-chnec-conll

Make sure to keep the order of the train, dev, and test files, and the corpora names consistent.

Finally, we told NameTag 3 how to mix the data from the two corpora into training batches:

--sampling=temperature

This means that the batches are constructed using square root temperature sampling with repetition, based on corpus length. This approach effectively downsamples the largest corpora while upsampling the smallest ones.

7. Multitagset Training

Since NameTag 3.1, NameTag can be trained with multiple named entity tagsets. The trained model can then be required to recognize the named entities using a specific tagset during inference, or a default tagset will be used if none was requested.

For reference, the NameTag 3 [Multilingual Model https://ufal.mff.cuni.cz/nametag/3/models#multilingual] was trained on 17 languages of 21 datasets using the following 3 tagsets:

  • conll (default): The CoNLL-2003 shared task tagset (Sang and De Meulder, 2003): PER, ORG, LOC, and MISC. Will be used for prediction when calling nametag3.py with --tagsets=conll or by requesting nametag3-multilingual-conll-250203 from the NameTag 3 webservice.
  • uner: The Universal NER v1 tagset: PER, ORG, LOC. Used when calling nametag3.py with --tagsets=uner or by requesting nametag3-multilingual-uner-250203 from the NameTag 3 webservice.
  • onto: The OntoNotes v5 tagset: PERSON, NORP, FAC, ORG, GPE, etc. Used when calling nametag3.py with --tagsets=onto or by requesting nametag3-multilingual-onto-250203 from the NameTag 3 webservice.

Multitagset training is currently supported for flat NER and with these three tagsets: conll, uner, onto.

Let's acquire some available data with a different tagset than the conll one we have been using so far:

Download the data of the Universal NER v1 project.

Each of the corpora in the UNER project can be converted to the NameTag 3 input format by using preprocessing/preprocess_uner.py. The script requires the source corpus path (--source_path), the source corpus name (--source_corpus), the converted target corpus path (--target_path), the converted target corpus name (--target_corpus), and the language (--language).

Optionally, you can separate the documents with --add_docstart if you plan on training with --context_type=split_document.

For example, to preprocess the UNER_Slovak-SNK corpus, one would call:

venv/bin/python3 preprocessing/preprocess_uner.py --add_docstarts --source_path=UNER/UNER_Slovak-SNK --source_corpus=UNER_Slovak-SNK --target_corpus=slovak-UNER_Slovak-SNK-uner --target_path=data_preprocessed/slovak-UNER_Slovak-SNK-uner --language=english

In our multitagset training script, we now have to add the third corpora to the mix:

--corpus=czech-cnec2.0-conll,czech-chnec-conll,slovak-UNER_Slovak-SNK-uner \
--dev_data=czech-cnec2.0-conll/dev.conll,czech-chnec-conll/dev.conll,slovak-UNER_Slovak-SNK-uner/dev.conll \
--test_data=czech-cnec2.0-conll/test.conll,czech-chnec-conll/test.conll,slovak-UNER_Slovak-SNK-uner/test.conll \
--train_data=czech-cnec2.0-conll/train.conll,$DATA/czech-chnec-conll/train.conll,$DATA/slovak-UNER_Slovak-SNK-uner/train.conll

But most importantly, we tell NameTag 3 to train with multiple tagsets, and what these tagsets explicitly are, in the order of the corpora:

--tagsets=conll,conll,uner

Results:

Model Training Data F1 (p, i, g, o, a, t) F1 (PER, ORG, LOC, MISC)
Hubková et al. (2020) CHNEC 73.00
xlm-roberta-base CHNEC 81.92
ufal/robeczech-base CHNEC 83.07
xlm-roberta-large CHNEC 85.15 86.73
xlm-roberta-large CHNEC + CNEC 89.13
xlm-roberta-large CHNEC + CNEC + SNK 89.59

8. Multitagset Training with a New Tagset

In the previous sections, we trained the Czech Historical Named Entity Corpus (CHNEC) using the unified conll tagset. But what if you want to use the original CHNEC tagset with labels p, i, g, o, a, and t, in a multitagset training setup?

By default, NameTag 3 supports only three tagsets for multitagset training: conll, uner, and onto. If you’d like to include a custom tagset, such as the original CHNEC tagset, you won’t be able to do this directly via the command line.

However, it’s easy to extend NameTag's list of supported tagsets manually. You just need to edit the internal list of allowed tagsets in the source code to include your custom one.

When you open the Python file nametag3_dataset.py and locate the Python dictionary TAGSETS, you'll see the definition of the supported tagsets:

TAGSETS = {
    "conll": ["B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC", "B-MISC", "I-MISC", "O"],
    "uner": ["B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC", "O"],
    "onto": ["O", "B-PERSON", "I-PERSON", "B-NORP", "I-NORP", "B-FAC", "I-FAC",
             "B-ORG", "I-ORG", "B-GPE", "I-GPE", "B-LOC", "I-LOC", "B-PRODUCT",
             "I-PRODUCT", "B-DATE", "I-DATE", "B-TIME", "I-TIME", "B-PERCENT",
             "I-PERCENT", "B-MONEY", "I-MONEY", "B-QUANTITY", "I-QUANTITY",
             "B-ORDINAL", "I-ORDINAL", "B-CARDINAL", "I-CARDINAL", "B-EVENT",
             "I-EVENT", "B-WORK_OF_ART", "I-WORK_OF_ART", "B-LAW", "I-LAW",
             "B-LANGUAGE", "I-LANGUAGE"]
}

Now, extend that definition with a new tagset chnec:

TAGSETS = {
    "conll": ["B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC", "B-MISC", "I-MISC", "O"],
    "uner": ["B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC", "O"],
    "onto": ["O", "B-PERSON", "I-PERSON", "B-NORP", "I-NORP", "B-FAC", "I-FAC",
             "B-ORG", "I-ORG", "B-GPE", "I-GPE", "B-LOC", "I-LOC", "B-PRODUCT",
             "I-PRODUCT", "B-DATE", "I-DATE", "B-TIME", "I-TIME", "B-PERCENT",
             "I-PERCENT", "B-MONEY", "I-MONEY", "B-QUANTITY", "I-QUANTITY",
             "B-ORDINAL", "I-ORDINAL", "B-CARDINAL", "I-CARDINAL", "B-EVENT",
             "I-EVENT", "B-WORK_OF_ART", "I-WORK_OF_ART", "B-LAW", "I-LAW",
             "B-LANGUAGE", "I-LANGUAGE"],
   "chnec": ["B-p", "I-p", "B-i", "I-i", "B-g", "I-g", "B-o", "I-o", "B-a", "I-a", "B-t", "I-t", "O"]
}

Be sure to list all the labels both with B- (Beginning) and I- (Inside), and include the O (Outside) label.

Also note the comma between the dictionary items, so "I-LANGUAGE"] becomes "I-LANGUAGE"],.

Now, we can train the CHNEC corpus with the chnec tagset jointly with the CNEC corpus with the conll tagset:

--corpus="czech-cnec2.0-conll,czech-chnec" \
--dev_data="czech-cnec2.0-conll/dev.conll,czech-chnec/dev.conll" \
--tagsets="conll,chnec" \
--test_data="czech-cnec2.0-conll/test.conll,czech-chnec/test.conll" \
--train_data="czech-cnec2.0-conll/train.conll,czech-chnec/train.conll"

For the entire script training on the original CHNEC jointly with CNEC, see tutorial/07_chnec_w_cnec-conll.sh. For a script training on the original CHNEC jointly with CNEC and Slovak SNK, see tutorial/08_chnec_w_cnec-conll_w_snk-uner.sh.

Finally, and this falls outside the official scope of the tutorial due to licensing restrictions on some data, it is also possible to train CHNEC in a multitagset setup alongside the 21 training corpora used to train the NameTag 3 Multilingual Model. See the CHNEC training script and the CHNEC-conll training script.

Results:

Model Training Data F1 (p, i, g, o, a, t) F1 (PER, ORG, LOC, MISC)
Hubková et al. (2020) CHNEC 73.00
xlm-roberta-base CHNEC 81.92
ufal/robeczech-base CHNEC 83.07
xlm-roberta-large CHNEC 85.15 86.73
xlm-roberta-large CHNEC + CNEC 88.70 89.13
xlm-roberta-large CHNEC + CNEC + SNK 87.91 89.59
xlm-roberta-large CHNEC + multilingual corpora 90.67 89.80

9. Continued Fine-Tuning

Now we are going to start with a language model that has already been fine-tuned for NameTag 3, and continue fine-tuning it on our own custom dataset. This approach is often called continued fine-tuning or sequential transfer learning and is especially useful when the original datasets used during the first fine-tuning phase are no longer accessible, for example due to licensing restrictions. By further adapting the model to our specific data, we can build on prior work while tailoring the model to our needs.

For continued fine-tuning to integrate seamlessly with the original NameTag 3 model, the new corpus must use the same tagset as one of the tagsets the model was originally trained on. This ensures compatibility with the model’s output layer.

For continued fine-tuning on CHNEC, we therefore select the NameTag 3 Multilingual Model as our starting point, and we use the conll version of the CHNEC. We modify the script for training CHNEC-conll from scratch by adding two new command-line arguments:

  1. We load the nametag3-multilingual-250203 model: --load_checkpoint=models/nametag3-multilingual-250203.
  2. As this model has been trained using multitagset training, we tell NameTag 3 which tagset CHNEC-conll is using: --tagset=conll.

Continued fine-tuning is a balancing act and setting hyperparameters for continued fine-tuning requires careful consideration. If the training runs for too few epochs, the model may underfit and fail to learn from the new dataset. On the other hand, training for too long can cause the model to overfit to the new data and lose the generalization ability it acquired during the initial fine-tuning. This is especially important when the new dataset is small or narrowly focused. (As always, it's important to tune these hyperparameters using the development (dev) set, not the test set. The dev set helps you make informed decisions about when to stop training and which settings work best, while keeping the test set untouched ensures an honest final evaluation of the model's performance. Using the test set for tuning would leak information and lead to overly optimistic results that don’t reflect real-world performance.)

Here, notice that we decreased the number of epochs: --epochs=10 for continued fine-tuning on CHNEC.

The continued fine-tuning script for CHNEC then contains the following NameTag 3 command:

/venv/bin/python nametag3.py \
  --batch_size=4 \
  --context_type="split_document" \
  --corpus="czech-chnec-conll" \
  --decoding="classification" \
  --dev_data="dev.conll" \
  --epochs=10 \
  --evaluate_test_data \
  --hf_plm="xlm-roberta-large" \
  --learning_rate=2e-5 \
  --load_checkpoint="models/nametag3-multilingual-250203/" \
  --logdir="logs_tutorial" \
  --name="chnec-conll" \
  --tagsets="conll" \
  --test_data="test.conll" \
  --train_data="train.conll"

In the following results table, compare the rows with just CHNEC as the training data. When fine-tuning with CHNEC-conll on xlm-roberta-large from scratch (4th row), we get F1=86.73, while by sequential fine-tuning on an lready fine-tuned NameTag 3 multilingual model, we improve to F1=88.41 (the last row).

Model Training Data F1 (p, i, g, o, a, t) F1 (PER, ORG, LOC, MISC)
Hubková et al. (2020) CHNEC 73.00
xlm-roberta-base CHNEC 81.92
ufal/robeczech-base CHNEC 83.07
xlm-roberta-large CHNEC 85.15 86.73
xlm-roberta-large CHNEC + CNEC 88.70 89.13
xlm-roberta-large CHNEC + CNEC + SNK 87.91 89.59
xlm-roberta-large CHNEC + multilingual corpora 90.67 89.80
NameTag 3 multilingual (fine-tuned xlm-roberta-large) CHNEC 88.41