The Czech Named Entity Corpus 2.0 is a corpus of 8993 Czech sentences with manually annotated 35220 Czech named entities. It is a major update to the Czech Named Entity Corpus 1.0, a first publicly available corpus providing a large body of manually annotated named entities in Czech sentences, including a fine-grained classification. The corpus is available under the CC BY-NC-SA 3.0 license.
The corpus uses 46 atomic named entity types, which can be embedded, e.g., the
river name can be part of a name of a city as in <gu Ústí nad <gh
Labem>>
. There are also 4 so-called NE containers: two or more NEs
are parts of a NE container (e.g., two NEs, a first name and a surname, form
together a person name NE container such as in <P <pf Jan><ps
Novák>>
). The 4 NE containers are marked with a capital one-letter
tag: P
for (complex) person names, T
for temporal
expressions, A
for addresses, and C
for bibliographic
items.
The named entities in Czech are classified according to an updated version of the hierarchy of CNEC 1.0 described in Ševčíková et al., 2007.
Named entities are saved in formats:
Czech Named Entity Corpus 2.0 can be downloaded from LINDAT/CLARIN repository.
The Czech Named Entity Corpus 2.0 is evaluated using the canonical script distributed with the corpus. The evaluation metric is a strict (both span and type must be correct) span-based micro F1.
The changes in the named entity hierarchy compared to CNEC 1.0 are the following:
New data was annotated and added: