The project of annotation of Czech texts in the PDT covers three levels: morphological, analytical and tectogrammatical.
Annotation at all of the levels is based on Czech texts having a SGML format (CSTS DTD), which is the basic format of the Czech National Corpus (ČNK). Most of the texts have been taken directly from ČNK. The Czech texts are already divided into separate words (word-forms), sentences and paragraphs in this format. Punctuation is explicitly marked and graphic information from the original text has been preserved wherever possible. Numbers in numerical form are also marked and decimals are normalized.
Texts chosen at random (in continuous samples) from the texts of ČNK are used for annotations at all levels.
The annotation (tagging) at the morphological level is linear. To each original word-form (name of the attribute: origf, SGML, attribute <w>) in the text, three attributes are assigned; namely word-form, lemma and tag. Tagging is manual with the aid of a full-screen program sgd working in the environment of Linux (which, however, can be carried on through the mediation of some remote means, e.g. from DOS). We are using also a MS Windows program called DA, which is compatible with sgd at the data level and which is very close also on the GUI level. Both programs require a preliminary morphological treatment of the original text, i.e., each word-form from it is supposed to be accompanied by a list of all possible lemmas and of their (possible) morphological categories. This assignment is done automatically on the basis of an electronic dictionary (at present the vocabulary covers some 98-99% of current newspaper or magazine texts, including names). The remaining word-forms are supplied by manual tagging. Typing errors are kept in the attribute origf; however, they are corrected (manually) and treated in the attribute form.
Morphological tagging with the aid of the program sgd (or DA) can be performed prior to annotation at the analytical level, but also after this has been accomplished, or in parallel. Both input and output data for the morphological annotation programs are in the SGML format according to the CSTS DTD. The volume of texts tagged at the morphological level is about 1.8 million tokens.
In most of the cases the word-form is identical with the original word-form as it has been found in the original text including the use of lower and/or upper case letters. Exceptions occur only in case the original word-form represents
In these cases the form (form) is derived from the original word-form (origf) in the following way:
origf | # of form attr. | 1st or the only form | 2nd form |
---|---|---|---|
number with a decimal point | 1 | number with a decimal point | |
form of the word aby/kdyby | 2 | aby/kdyby | conditional by in corresponding form (e.g., bychom) |
preposition with a pronoun | 2 | preposition | pronoun in the corresp. (long) form (e.g. naň -> na + něj) |
word with an -s | 2 | word without -s | jsi |
typing error | 1 | corrected form | |
typing error with contracted forms | 2 | see line 2 - 4 corrected | see line 2 - 4 corrected form |
The lemma unequivocally identifies a word as a lexical unit. It is represented by a string of letters and signs which in most of the cases corresponds to the so-called dictionary form of the word, or, to put it differently, to the word-form under which the word usually figures in dictionaries.
Part of speech |
Morphological categories of the word-form in the attribute lemma |
---|---|
Noun | Nominative, singular, no negation (unless there is a positive form and negation does not change the lexical meaning; pluralia tantum: the same, but in plural |
Adjective | Masculinum animate, nominative, singular, no negation, 1st degree of comparison (positive) |
Pronoun | If there are such categories: Nominative, singular, masculine animate, no negation; (particularly: personal pronouns only já, ty, on (I, you, he)) |
Numeral | If there are such categories: Nominative, singular, masculine animate, no negation |
Verb | Infinitive |
Adverb | 1st degree, no negation |
Preposition | no vocalisation |
The rest | the original form |
Orthographic variants are to be unified (if, of course, they represent just genuine orthographic variants and not, e.g., a shift in meaning; this concerns the category "rest" as well).
The identification string obtained in this way can be completed by additional distinguishing identification(s) which consist of a hyphen and one or more decimal numbers (e.g., -2). Isolated zero is not used. This identification serves for distinguishing grammatical forms belonging to different lexical units (e.g., the noun hnát-2 versus the verb hnát-1, -2 standing for shank, -1 for verb drive, pursue; cf. English bear N and bear V). In exceptional cases such means can be used for distinguishing the meanings of a full homonym: e.g., strana-4 (=page) vs. strana-2 (=polit. party).
Upper- and lowercase letters play their part in distinguishing lexical units; they are used to distinguish common names from proper names otherwise identical (e.g., trnka vs. Trnka - black-thorn and Trnka, professor). The original "size" of the letters as they have been found in the text (attributes form or origf) is disregarded, i.e., if a (common) word was originally written with an uppercase letter in initial position (titles, beginning of a sentence), it is contained in the attribute lemma in lowercase letters only.
The morphological marker consists in a sequence of uppercase and lowercase letters of the English alphabet (and some other allowed symbols) and of digits. There are 15 positions in the tag (13 of them actually used) for the morphological category values.
Pos. | Category | Description | Czech Term |
---|---|---|---|
1 | POS | Part of Speech | Slovní druh |
2 | SUBPOS | Detailed Part of Speech | Slovní poddruh |
3 | GENDER | Agreement Gender | Rod |
4 | NUMBER | Agreement Number | Číslo |
5 | CASE | Case | Pád |
6 | POSSGENDER | Possessor's Gender | Rod vlastníka |
7 | POSSNUMBER | Possessor's Number | Číslo vlastníka |
8 | PERSON | Person | Osoba |
9 | TENSE | Tense | Čas |
10 | GRADE | Degree of Comparison | Stupeň |
11 | NEGATION | Negation (by prefix) | Negace |
12 | VOICE | Voice | Slovesný rod |
13 | RESERVE1 | Reserved for future use | Rezerva |
14 | RESERVE2 | Reserved for future use | Rezerva |
15 | VAR | Variant, Style, Register | Varianta, styl |
Distinguishing part-of-speech category according to the first letter of the tag:
1st letter of thetag | part-of-speech |
---|---|
N | noun |
A | adjective |
P | pronoun |
C | numeral |
V | verb |
D | adverb |
R | preposition |
J | conjunction |
I | interjection |
T | particle |
Z | punctuation, numeral figures, root of the tree |
X | (unknown, unidentified) |
For sentence boundaries the tag Z#------------- is assigned, while for punctuation it is Z:-------------; however, the sentence boundary is not used explicitly at the morphological level.