w
Is the form of the original token as found in the original
source of text. It's text #PCDATA is in most cases
identical to the initial text (#PCDATA) of the <f> element, in which case it can be
completely omitted. Otherwise it must immediately precede
the corresponding "normalized" <f> element(s).
It is used in the following cases:
- for automatically processed data:
- normalized numbers; spaces and/or other thousand separators are
removed in <f>, decimal
separators other that periods are replaced by periods. The kind attribute has the value
num.orig in both cases. There is always exactly one
<w> element for an <f> element.
- contracted forms; examples include tys (ty +
jsi), naè (na + co),
abys/abychom/abyste, and all words with
attached "-s" (jsi) if identified by the
automatic processing at tokenization or tagging time. (In English,
this would normally include isn't, wanna, etc.)
The kind has the value
ctcd, and all the case attributes of the
corresponding <f> elements
have the (sub)string gen added.
- parts of phrases treated as a single token in the subsequent
processing; for example, fixed multiword names are treated this
way, as is the fixed phrase (být) s to. It includes
also peculiar formatting such as titles widened by spaces (such as
P r a g u e) etc. The kind attribute has the value
phrpart at every instance of <w>, and all the case attributes of the
corresponding <f> elements
have the (sub)string phrase added.
- for manually annotated data:
- spelling errors; the string with an error is preserved at this
element, with the kind
attribute set to spell.
- missing forms; rarely used since only obvious omissions
(normally classified as typos) are being corrected. This is the
only case when the element's text (#PCDATA) is empty;
kind attribute set to
ins.
- superfluous "forms"; used e.g. for graphical symbols made up
from letters and punctuation and misidentified by the tokenized as
words; kind attribute set
to del.
- any of the cases listed above in the "automatic" list, if the
automatic tokenization procedure got it wrong.
In the PDT data, the default value same of the kind attribute is never used
explicitly; in fact, the whole <w> element, although theoretically
correct, is never present in such a case.
Content
ATTRIBUTES
CONTENT DECLARATION
- Tag Minimization
- Open Tag: REQUIRED
Close Tag: OPTIONAL
Parent Elements
Top Elements
All Elements
csts DTD