Table of Contents
Unlike in version 1.0, it is now preferred to separate named entity tagging from morphology. Named entities (often multiple-word) should be marked and categorized as special phrases on a layer other than morphological; this is a separate project that has not been included in PDT 2.0. Lemmas of proper names will still bear information on the name category. Nevertheless, we respect the original idea that the term suffixes shall explain the meaning of the lemma, not the context it appears in. Thus for instance New should be lemmatized as new_,t
in New York, not New_;G
. York should be lemmatized York_;G
even in New York Times where it was previously York_;K
. For details see below.
Unfortunately, it was not manageable to enforce the desired lemmatization in PDT 2.0. The annotation is still inconsistent in this respect. We plan to correct it in a future version.
Table 3.1. Name types
Type | Explanation, examples |
---|---|
Y | given name (formerly used as default): Petr, John |
S | surname, family name: Dvořák, Zelený, Agassi, Bush |
E | member of a particular nation, inhabitant of a particular territory: Čech, Kolumbijec, Newyorčan |
G | geographical name: Praha, Tatry (the mountains) |
K | company, organization, institution: Tatra (the company) |
R | product: Tatra (the car) |
m | other proper name: names of mines, stadiums, guerilla bases, etc. |
The lemma should start with upper case if the word is always in upper-case in names (Špaček_;S
is always capitalized, špaček
is not).