Table of Contents
The Prague Markup Language (PML) is a common basis of an open family of XML-based data formats for representing rich linguistic annotations of texts, such as morphological tagging, dependency trees, etc. PML is an on-going project in its early stage. This documentation reflects the current status of the PML development.
PML tries to identify common abstract data types and structures used in linguistic annotations of texts as well as in lexicons (especially those intended for machine use in NLP) and other types of linguistic data, and to define a unified, straightforward and coherent XML-based representation for values of these abstract types. PML also emphasizes the following aspects of linguistic annotation: the stand-off annotation methodology, possibility to stack layers of annotation one over another, and extensive cross-referencing. PML also tries to retain simplicity, so that PML instances (actual PML representation of the data) could be processed with conventional XML-oriented tools.
Unlike, e.g. TEI XML, XHTML or DocBook, PML by itself is not a full XML vocabulary but rather a system for defining such vocabularies.
A fully specified XML vocabulary satisfying the requirements constituted in this document is called an application of PML. An Application of PML is formally defined using a specialized XML file called PML schema. PML schema provides one level of abstraction over standard XML-schema languages such as Relax NG or W3C XML Schemas. It defines an XML vocabulary and document structure by means of PML data types and PML roles. An XML document conforming to a PML schema is a PML instance of the schema. PML data types, described in detail in Section 2, “PML data types”, include atomic types (identifiers, strings, integers, enumerated types, id-references, etc.), and complex types, which are composed from abstract types such as attribute-value structures (AVS), lists, alternatives, and mixed-type sequences. We refer to a value of a complex type as a construct. The information provided by PML roles is orthogonal to data typing. It identifies a construct as a bearer of an additional higher-level property of the annotation, such as being a node of a dependency tree, or being a unique identifier (see Section 4, “PML roles”).
Based on a PML schema of a particular application of PML, it is possible to automatically derive a corresponding Relax NG schema that conventional XML-oriented tools can use to validate actual PML instances (see Section 11, “Tools”).
All XML tags used in applications of PML belong to a dedicated XML namespace
http://ufal.mff.cuni.cz/pdt/pml/
We will refer to the above namespace as PML namespace.
PML schema files use the following XML namespace referred to as PML schema namespace:
http://ufal.mff.cuni.cz/pdt/pml/schema/
Currently PML reserves three element names from the PML
namespace for the representation of the technical elements:
LM
(for bracketing list members), AM
(for bracketing alternative members), and head
(for a
common PML instance header described in detail in Section 5, “Header of a PML instance”).
The PML currently recognizes the following abstract data types from which complex data types are built by means of composition:
Atomic values are literal strings. The exact content of an atomic value may be further specified as its format (see Section 3, “Atomic data formats”). In the XML, atomic values are (depending on the context) represented in XML either as a CDATA (i.e. text) content of an element or as an attribute value.
An atomic-value type defined as an exhaustive list of possible values of that type.
A structure is a versatile PML abstract type.
Sometimes it is called a feature-structure, attribute-value structure or
AVS. To avoid confusion with XML attributes, we refer
to attributes of a structure as
members.
A structure is similar to a struct
type in the C
programming language.
A structure is fully specified by names, types
and optionally roles for each of its members.
Different members of the structure must have distinct names.
The structure is represented in XML by an element
whose only content are attributes and/or sub-elements representing the members of the
structure. An attribute or sub-element representing a
member is named by the member and its content is the XML
representation of the member's value.
The order of members in the structure as represented in
XML may be arbitrary. Whether a particular
member is represented by an attribute or a sub-element is
specified in the PML schema, however, only members with
values of atomic types can be represented by
attributes. Some structure members
may in the PML schema be formally declared as required,
in which case they must appear in the structure and its
XML representation and must have non-empty content.
All members not explicitly declared as required are
optional.
PML offers unified representation of both ordered and
unordered lists of constructs of the same type (the
list member type). PML lists
represent data similar to arrays in various
programming languages. An XML element representing a
construct of a list type must as its only child-nodes have
either zero or more XML elements named LM
(“List Member”),
each representing a construct of the list member type, or
else (as a compact representation of singleton lists) its
content must be of the list member type. List member type
can not be a list, i.e. lists of lists are not
allowed. Technically, the difference between ordered and
unordered lists is only in the declaration. Ordered lists
may still contain repeated member (members with the same
value). Applications are only required to preserve
the ordering of ordered lists.
Similar to unordered lists but different in usage and semantics
are alternatives. Alternatives can be used to represent
data where usually one value of a certain type is used,
but under some circumstances several alternative (or parallel) values
are allowed. An XML element representing an alternative
of constructs of a certain type (alternative member type)
is either a representation of a construct of that type
(in case of a single value, i.e. no actual alternative values) or
has as its only child-nodes two or more XML elements named
AM
(“Alternative Member”),
each of which represents a construct of the alternative member type.
Alternative member type must not be an alternative, i.e.
alternatives of alternatives are not allowed.
Sequences are similar to ordered lists but do not require their member constructs to be of the same type. Each member of a sequence is represented by an XML element whose name is bound in the sequence definition with the type of the construct it bears and whose content represents the value. The order and number of occurrences of elements in a sequence may be specified by a regular expression or left unrestricted.
Containers are similar but simpler to structures and can be used to annotate a piece of data by a set of attribute-value pairs with text-only values. They are represented in XML by an element whose content is the data and XML attributes are the annotation. The content of a container can be of any type except for a container and structure.
Because of the compact representation of
singleton lists and alternatives, a special care should be
taken when using containers with content whose type is a
list or alternative of containers or structures in order to avoid possible
collisions between names of the attributes of the container and
attributes or members rendered as attribute of the contained
(singleton) container or structure. (This problem also applies to
type derivation and also to inheritance which is to appear in a future revision of this specification).
To avoid such problems, applications serializing PML data to XML are allowed to
surround singleton list or alternative members within a container by
LM
or AM
tags respectively,
and they must do so if a name collision is apparent from the PML schema.
PML currently recognizes the follwoing atomic data formats: In the future, specification for more formats will be added and/or some generic mechanism for introducing user-defined atomic formats will be added.
any
Arbitrary string of characters (used in all cases not covered by the formats below).
ID
An identifier string, i.e. a string
satisfying the
NCName
production of
the W3C specification Namespaces in XML.
Note in particular that the specification explicitly forbids a colon (:
) to
occur within an identifier.
Example: ab
, doc1.para2
, and _d3p9_34-a2
,
are all valid identifiers,
(whereas -ab
, 234a
, and a:x34
are all invalid).
PMLREF
An atomic value which either is of the ID
format
described above, or consists of two substrings of the format ID
delimited by the character #
.
Values of this format usually represent a reference (link),
see Section 9, “References in PML”.
Example: doc1#chap2-para3
or doc1
.
PML further recognizes the following selected XML Schema built-in simple types
as PML cdata
formats
(each format is specified to cover the lexical space of the corresponding simple type in the XML Schema specification without
consraining facets):
string
,
normalizedString
,
token
,
base64Binary
,
hexBinary
,
integer
,
positiveInteger
,
negativeInteger
,
nonNegativeInteger
,
nonPositiveInteger
,
long
,
unsignedLong
,
int
,
unsignedInt
,
short
,
unsignedShort
,
byte
,
unsignedByte
,
decimal
,
float
,
double
,
boolean
,
duration
,
dateTime
,
date
,
time
,
gYear
,
gYearMonth
,
gMonth
,
gMonthDay
,
gDay
,
Name
,
NCName
,
anyURI
,
language
,
IDREF
,
IDREFS
,
NMTOKEN
,
NMTOKENS
,
PML roles indicate a formal role that a given construct plays in the annotation schema. Roles are orthogonal to types, but usually are compatible only with certain types of constructs. Roles are primarily intended to be used by applications processing the data. So far the following roles have been specified:
Only applicable to a list or sequence constructs. This role identifies a construct whose member constructs represent dependency or constituency trees.
Only applicable to a structure or a sequence-member construct. This role identifies a node of a dependency or constituency tree.
Only applicable to a sequence member or a container whose content is
of a list or a sequence and whose role is
#NODE
.
or to a sequence in an element of role #NODE
.
This role identifies a construct representing a list of child-nodes
of a node in a dependency or constituency tree.
Only applicable to an atomic construct,
typically with the format ID
.
A value with this role uniquely identifies
a construct (a structure, sequence, container, etc.) in
the PML instance. This means that all values with the role
#ID
within a PML instance are distinct..
This role indicates that the application may resolve the atomic value(s) as PML references and replace their content with copies of the referenced PML constructs. This role is only applicable to either:
a structure member of atomic type with the PMLREF format
a sequence element of atomic type with the PMLREF format
a list of atomic members with the PMLREF format. The list must occur in a container or as a structure member.
This role identifies a structure member containing a non-negative integer value used for ordering nodes in an ordered tree.
This role identifies a structure member whose non-zero non-empty value indicates that an application may hide the structure from the user.
Every PML instance starts with the header
element
which must occur as the first sub-element of the
document element. The header element has the following
sub-elements:
schema
Associates the instance with a PML schema file, indicating
that the instance conforms to the associated schema. The filename
or URL of the PML schema file is specified in the attribute
href
.
references
This element contains zero or more reffile
sub-elements, each of which maps a filename or URL (attribute
href
) of some external
resource to an identifier (attribute id
)
used as aliases when referring to the resource from the instance
(see Section 9, “References in PML”).
If the external resource is an instance
bound with the current instance as declared
in the PML schema, then reffile
must have also a third attribute, name
,
containing the name used in the tag
reference
in the PML schema declaration
of the bound instance. For every resource bound
to the instance in the PML schema (using
reference
tag) there must be
a corresponding reffile
.
In this section, the syntax of a PML schema file is specified. We describe the content of individual PML schema elements by formal patterns similar to the grammar used in DTD for element-content model specification:
name
lower-case literals denote names of XML elements
PCDATA
denotes arbitrary text content
EMPTY
denotes empty content
(
...)
brackets delimit groups of adjacent content
?
indicates that the element or group whose specification immediately precedes is optional
*
indicates that the element or group whose specification immediately can be repeated
|
separates specifications of exclusively alternative content
,
separates specifications of adjacent content
A formal definition of the PML schema file syntax is available as a Relax NG schema, see Relax NG for PML schema.
All elements of the PML schema file belong to the PML schema namespace. The following elements may occur in a PML schema:
pml_schema
This is the root element of a PML schema file. It may have no attributes (except
for the xmlns
declaration of the
PML-schema namespace). It consists of an optional description
,
declarations of common instance reference
s,
Schema modularization instructions
import
and derive
,
a root
declaration
and zero or more declarations of named types (type
).
PML schemas which contain no import
and derive
instructions are called
simplified PML schemas. The
section Section 7, “Processing modular PML schemas” describes how
import
and derive
instructions are to be processed in order
to obtain an equivalent simplified PML schema.
Attributes
version
Version of the PML specification the schema conforms to, currently
%pml_version;
Content.
(revision?, description?, reference*, import*, derive*, root?, type*)
revision
This optional element is used to assign a revision number to a particular version of the PML schema, see Section 8, “Numbering revisions of PML schemas”
Content.
PCDATA
description
This element provides an optional short description of the PML schema.
Content.
PCDATA
reference
This element declares that each instance of the PML schema is bound with another PML instance (usually of a different PML schema) and provides a hint for an application on how to process the bound instance.
Attributes
name
a symbolic name for the bound instance.
This name is used in the reffile
element in
the referring file's header to identify the bound
instance (see Section 5, “Header of a PML instance”).
readas
the value trees
instructs the application to
read the bound instance as a sequence of dependency or
constituency trees; value dom
instructs
the application to read the bound instance using the
generic Document object model.
import
This element instructs an application processing the PML schema to
load root
and type
declarations from an external PML schema file specified in the attribute schema
to the current PML schema.
The way in which the declarations are to be combined is
described in Section 7, “Processing modular PML schemas”.
Attributes
type
Name of a specific type to import from the external PML schema file.
schema
A filename or URL of the imported external PML schema file.
revision
Constrains revision
of the imported schema to the specific value.
See Section 8, “Numbering revisions of PML schemas” for information on comparing revision numbers.
If this attribute is present, then minimal_revision
and
maximal_revision
attributes should be absent. (optional)
minimal_revision
Constrains the revision
of the imported schema to
revision numbers larger or equal to the one specified.
See Section 8, “Numbering revisions of PML schemas” for information on comparing revision numbers.
If this attribute present,
then revision
attribute should be absent. (optional)
maximal_revision
Constrains revision
of the imported schema to
revisions numbers smaller or equal to the one specified.
See Section 8, “Numbering revisions of PML schemas” for information on comparing revision numbers.
If this attribute present,
then revision
attribute should be absent. (optional)
Content.
EMPTY
derive
This element instructs an application processing the PML
schema to create a new type declaration by extending or modifying an
existing base declaration
specified by the type
attribute. The
newly created type declaration is called the
derived declaration.
The base declaration may either be one explicitely given
by a type
element, or
a previously derived one. The base declaration must declare
one of the following types:
structure
,
sequence
,
container
, or
choice
.
The element derive
must contain exactly
one of the following subelements: structure
, sequence
, choice
,
container
, corresponding
to the type of the base declaration.
In the context of the derive
element,
the content and semantics of either of the above listed subelements
differs from what is defined elsewhere in this
specification in the following way:
each
member
,
element
,
attribute
, or
value
subelement either replaces a
member, element, attribute, or value with the same name (or content in case of value
)
from the base declaration, if such a one exists,
or adds a new member, element, attribute, or value
declaration to the derived
structure, sequence, container, or choice declaration (respectively).
additionally, the subelement may
contain zero or more delete
instructions
each specifying a member, element, attribute, or value
of the base declaration to be omitted from the derived declaration.
See Section 7, “Processing modular PML schemas” for detailed instructions on processing
the derive
instruction.
Attributes
type
A name of the base type declaration (required)
name
A name for the derived type declaration (optional). If not specified, the derived declaration replaces the base declaration (this feature should be used with care and, advisably, only for base declarations imported from external PML schemas).
delete
This instruction can only occur
in a structure
, sequence
, choice
, or
container
subelement of a
derive
element
(and is therefore not included in the specifications of the content
of these individual elements).
The content is a name of a
member
,
element
,
attribute
, or
value
of a base declaration to be omitted from the derived declaration;
see derive
and
Section 7, “Processing modular PML schemas” for processing details.
Content.
#PCDATA
root
Declaration of the root element of a PML instance.
Attributes
name
The name of the element (required)
type
declares that the root-element's content is a
construct of a given named type. This attribute is complementary to
content, i.e. if this attribute is present, then
root
must be an
empty element. The named type this attribute refers to
must follow the content pattern specified below.
Moreover, if the root element's content is declared as
a container
, then
the container content type may only be a
sequence
.
type
Declaration a named type. Named types are referred to
from other elements using the attribute type
. A named type may only be
referred from contexts where the actual type represented
by the named type is allowed. In other words,
if an element in a PML schema refers to a named type,
then the content of the named type definition must be
also a valid content for the referring element.
Attributes
name
The name of the new named type (required)
role
The PML role of constructs of the type (optional)
Content.
(alt | list | choice | constant | structure | container | sequence | cdata)
structure
Declares a complex type which is a structure
with the specified members. Its content consists of one
or more member
elements defining members of the structure.
Attributes
name
An optional name of the type. This name is not used in the PML schema, but may be used by applications, e.g. when presenting constructs of the type to the user. (optional)
role
The PML role of the constructs of the type (optional)
Content.
(member)+
member
Declares a member of a structure.
The attribute name
defines
the name of the member. The type of the member's value
is specified either by the content or using the type
attribute.
It is an error if a structure declaration contains two member declarations
with the same name.
Attributes
name
Name of the member (required)
required
value 1
declares the
member as required, value 0
declares the member as optional (default is 0)
role
PML role of the member's value (optional)
as_attribute
value 1
declares that the
member is in XML realized as an attribute of the
element realizing the structure. In that case, the
value type must be atomic. Value
0
declares that the member is
realized as an XML element whose content realizes
the value construct. In the latter case case no
restrictions are put on the value type
(default is 0)
type
declares that the value type is the given named type (if this attribute is present, the element must be empty)
Content.
(alt | list | choice | constant | structure | container | sequence | cdata)
list
Defines a complex type as a list of constructs of a
given type. The content defines the type of the list members
(unless a named type is specified in the type
attribute).
Attributes
ordered
value 1
declares an
ordered list, value 0
declares an unordered list (required)
type
declares that the constructs contained in the list are of a given named type (complementary to content)
role
PML-role of constructs of the type - currently only roles
#KNIT
and #CHILDNODES
may be used with lists (optional)
Content.
(alt | choice | constant | structure | container |sequence | cdata)
alt
Defines a type which is an alternative of constructs of a
given type. The content defines a type of the alternative members
(unless a named type is specified in the type
attribute).
Attributes
type
declares that the constructs contained in the list are of a given named type (complementary to content)
Content.
(list | choice | constant | structure | container | sequence | cdata)
choice
Defines an enumerated type with a set of
possible values specified in the value
sub-elements.
Content.
(value)+
value
The text content of this element is one of the values of an enumerated type.
Content.
PCDATA
cdata
Defines an atomic type.
Constructs of atomic types are represented
in XML as text or attribute values. The atomic type is
further specified using the format
attribute which can have one of the values listed in Section 3, “Atomic data formats”.
Content.
EMPTY
constant
Defines an atomic type with a constant value specified in the content.
Content.
PCDATA
sequence
Defines a data type representing ordered sequences of zero or more constituents. Each constituent is either a string of text or a named element whose content data type is uniquely determined by the element's name. The declaration of a sequence
specifies elements which can occur in the sequence, uniquely mapping element names to data types,
indicates if text constituents are allowed to occur in the sequence (sequences permitting text constituents are called mixed-content sequences),
and, optionally, provides a simple regular-expression-like pattern describing all admissible orderings of constituents (element and interleaved text) in the sequence
Two text constituents in a mixed-content sequences should never be adjacent, i.e. there must always be an element occurring between every two text constituents.
Attributes
role
PML role of constructs of the type (optional)
content_pattern
This attribute constraints the order in which the constituents are allowed to appear in the sequence by means of an expression called content pattern (very similar and conceptually equivalent to the grammar of the element content model declaration in DTD and also similar to the syntax used in the productions in this specification).
The content pattern is built on content particles (cp
's), which consist of constituent specifiers, choice lists of content particles, or sequence lists of content particles.
The syntax of a content pattern is given by the following grammar production rules:
pattern ::= ( choice | seq | cp ) cnst ::= ( Element-name | '#TEXT' ) cp ::= ( cnst | '(' choice ')' | '(' seq ')' ) quantifier? quantifier ::= ( '?' | '*' | '+' )? choice ::= WS? cp ( WS? '|' WS? cp )+ WS? seq ::= WS? cp ( WS? ',' WS? cp )* WS? WS ::= (#x20 | #x9 | #xD | #xA)+
where pattern
is the content pattern;
a content particle (cp
) represents one or more constituents,
which may appear in the sequence on a position in which
the content particle appears in the pattern;
Element-name
is a name of an element
constituent (see element) and represents any element constituent with this name; the string '#TEXT'
represents a text constituent;
any of the content particles occurring in a
choice
group may appear
in the sequence at a position in which the
choice
group appears in the pattern;
content particles occurring in a seq
group
must each appear in the sequence in the order
in which it is listed in the group;
optional quantifier
character
following a content particle governs whether the content it represents may occur one or more (+
), zero or more (*
), or zero or one times (?
). The absence of a quantifier means that the content specified by the content particle must appear exactly once;
content particles, brackets, commas, etc. may be optionally separated by white-space
(WS
).
A sequence matches a content pattern if and only if there is a path through the content pattern, obeying the sequence, choice, and quantifier operators and matching each constituent in the sequence against a cnst
production (Element-name
or '#TEXT'
). It is an error if the pattern allows two adjacent text constituents.
For compatibility with some SGML based content model implementations, it is advisable (but not enforced) to avoid non-deterministic (1-ambiguous) content patterns such as (a,b)*,a?
(see e.g. Appendix E Deterministic Content Models (Non-Normative) to the XML 1.0 specification and the pointers therein). In particular, a constituent should not match more than one
occurrence of a cnst
production in the content pattern.
text
This element can be used at the beginning
of the sequence
element
to indicate that the sequence is of mixed-content.
In that case, every (maximal) contiguous character content
(including white-space) occurring within the XML element representing
the sequence is treated as a constituent of the sequence.
Content. EMPTY
element
Declares an element constituent of a sequence.
The attribute name
specifies its name and
either the content or the type
attribute defines the value type.
It is an error if a sequence
declaration contains two elements
with the same name.
Attributes
name
name of the element (required)
role
PML role of the construct (optional)
type
declares that the element's content is a construct of a given named type (complementary to content)
Content.
(alt | list | choice | constant | structure | container | sequence | cdata)
container
Declares a container type. A container consists of a content value accompanied
by an annotation provided by a set of name-value pairs with atomic values called attributes.
The declaration consists of zero or more
attribute
declarations, followed
by the content type declaration. The content can be of any type except
for container and structure. Containers with empty content
(indicated by absence of a content type declaration) are permitted.
Attributes
role
PML role of the construct (optional)
type
declares that the content is a construct of a given named type (complementary to content)
Content.
attribute* (alt | list | choice | constant | sequence | cdata)?
attribute
Defines an attribute of a container
.
The content defines the type of attribute's value.
Attributes
name
name of the attribute (required)
required
value 1
declares the
attribute as required, i.e. one that must be present on its container; value 0
declares the attribute as optional (defaults to 0 - optional)
role
defines a PML role of the attribute (optional)
type
defines the type of the attribute value as
a given named type. The named type must be atomic. (The type
attribute is complementary to content.)
A simplified PML schema is one
which does not contain any import
and derive
.
Simplified PML schemas are thus self-contained.
This section describes how to process a PML schema containing
import
and derive
instructions in order to obtain a simplified PML schema
semantically equivalent to the original PML schema
(we call two PML schemas semantically equivalent
if they describe the same class of instances,
mapping same data to same data types
and identifying these types with the same PML roles).
We describe the process of simplification of a PML schema by means of modifications to the original PML schema, although a particular implementation might choose a different processing strategy. See Section 11, “Tools” for a pointer to a reference implementation of this process.
A PML schema processor must first
process all import
instructions
in the order in which they appear in the PML schema
and then process the derive
instructions.
We call current schema the PML schema containing the import
element in turn, and imported schema
the PML schema referred to by the attribute
schema
of the import
element. The processing of the import
instruction
differs depending on the presence of the type
attribute.
If the type
attribute is present, the element is processed as follows:
If the current schema contains a named type
declaration
with the attribute name
equal to the value of the
type
attribute of the import
element, then the processing of the
import
element stops
and it is removed from the current schema
(this includes cases when the type declaration was added to the current
schema during processing of any preceding import
elements).
Otherwise, the imported schema is read from the
file specified by a path (absolute or relative to the location
of the file containing the current schema) or an URL contained in the schema
attribute
of the element import
.
The imported schema is parsed and
its revision
number
is mached against revision
or
minimal_revision
and maximal_revision
attributes of the import
element (if any of them is present).
More specifically, if revision
attribute of import
is present then the imported schema revision must be equal to it.
If minimal_revision
attribute of import
is present then the imported schema revision must be greater or equal to it.
If maximal_revision
attribute of import
is present then the imported schema revision must be less or equal to it.
It is an error if the revision of the imported schema does not match these constraints.
The details of revision numbering and comparison of revision numbers
are given in Section 8, “Numbering revisions of PML schemas”.
The imported schema is processed according to these instructions
into a simplified PML schema. (It is an error if two or more PML schemas refer among
themselves via import
elements in a way that forms a cycle
or if a PML schema refers via the import
element to itself).
A type
declaration
with name
attribute equal to the
type
attribute of the import
element
is located in the imported schema and copied to the current schema.
It is an error if such a declaration cannot be found in the imported schema.
Every named type referred to
by a type
attribute from any element
occurring within the copied declaration is also copied
from the imported schema to the current schema,
unless a type
declaration
with the same name
already exists in the current schema.
This step is repeated as long as there are
copied type declarations referring to declarations
in the imported schema for which there is no type
declaration
in the current schema with the same name
(either a copied or an original one). In other words,
after copying the first type
declaration,
other type
declarations may be copied
to the current schema so that all references to named types
are satisfied.
Finally, the import
element is removed from the current schema.
If the attribute type
of the import
element
is absent, the instruction is processed as follows:
The imported schema is read from the
file specified by the schema
attribute
of the import
element, parsed and processed
just as in the prior case.
If the current schema does not contain
root
declaration
and there is a root
declaration
in the imported schema, it is copied to the current schema.
Every type
declaration
is copied from the imported schema to the current schema,
unless there already is a type
declaration
with the same name
in the current schema.
The import
element is removed from the current schema.
The derive
instructions cannot
be processed if the schema contains any non-processed
import
instructions.
The derive
element
has an attribute type
referring
to a named type
declaration which will be called
the base declaration.
It is an error if the PML schema (after all preceding
derive
instructions have been processed)
does not contain a corresponding base declaration,
i.e. a declaration whose attribute name
equals
to the attribute type
of the
derive
element.
If the derive
element contains an attribute name
specifying a target declaration name,
the base declaration is copied to the PML schema
as a new type
declaration
under the target declaration name.
We refer to this copy as target declaration.
It is an error if prior to creating the target declaration
the PML schema already contained
a named declaration with the same name,
except for the case when the target declaration name
is the same the name of the base declaration.
In the latter case, or if
name
attribute
of the derive
element
is absent, the target declaration is the base declaration.
The derive
element
and the target declaration must contain the same subelement,
which is one of
structure
,
sequence
,
container
, or
choice
.
We refer to the subelement of the derive
element as template
and to the subelement of the type
element representing the target declaration as target.
For each attribute of the template with a non-empty value the corresponding attribute on the target is added or if already present, its value is changed to match the value on the template. If an attribute is present on both the template and target but its value on the template is empty, the attribute on the target is removed from the target element. All other attributes of the target are left unchanged.
If the template (and hence also the target) is a structure
element, all member
subelements of the template are copied
into the target, unless
the target structure
already contains a member
subelement with the same name, in which case
the member
subelement
from the template replaces the corresponding
subelement of the target structure
.
Then, for every delete
subelement of the template,
the member
subelement
of the target structure
whose name
attribute equals to the content
of the delete
subelement
is removed from the target structure
.
It is an error if the template contains a delete
for which there is no matching member
subelement
in the target structure
.
The processing of template and target elements which are a sequence
,
a container
, or a choice
is defined analogously, replacing
structure
and member
by
sequence
and element
,
container
and choice
, or
choice
and value
,
respectively, and in the case of a choice
,
using value
element's content
instead of the name
attribute.
The derive
element is removed from the PML schema
after the template has been processed as described above.
For maintenance and modularization purposes it is advisible that
every revision of a PML schema which adds or modifies type
declarations is assigned a unique revision number. For this
purpose, PML provides the element revision
of the PML schema. A modular
PML schema using the import
instruction to import types from another PML schema may specify
constraints on the revision number of the imported schema. This
section defines the format for PML schema revision numbers and
revision number comparison method
(implying a total order on revision numbers).
Consequent revisions of a single PML schema file must be numbered in a non-decreasing
order.
Revision numbers should be strings constisting of one or more interleaved
non-negative integer numbers and the character '.
',
starting and ending with a number.
For example, 12
, 0.2.223
,
and 12.23.1.2.2
are all valid revision numbers,
whereas .3
, -3
,1.2.
,
or 74..23
are not.
We now describe comparizon of two revision numbers.
Let
R
=
and
r
1.r
2.….r
nS=
be two revision numbers, where
s
1.s
2.….s
k
(for r
ii=1,…,n
)
and
(for s
jj=1,…,k
) are non-negative integers and let n
be less or equal to k
.
Define
for every
r
i=0i>n
. Then R=S
if and only if
for every r
i=s
ii=1,…,k
;
R<S
if and only if
or there is some r
1<s
1j<k
such that
for every r
i=s
ii=1,…,j
and
;
otherwise r
j+1=s
j+1R>S
.
For example, 1.0.0=1
, 2.1.3.8<2.1.12.8
, and
2>1.9.8
.
While it is likely that in the future PML will offer other kinds of references, such as XPointer, currently PML only defines syntax and semantics for simple ID-based references to PML structure, element or sequence constructs occurring either in the same or some other PML instance, and to XML elements of non-PML XML documents in general. Also, there is no syntax defined yet for references to non-XML resources or to constructs without an ID.
A reference to a construct occurring within the same PML instance is
represented by the ID of the referred construct (see more specific
definition below). A reference to an object occurring
outside the PML instance is represented by a string
formatted according to PMLREF
format,
i.e., a string consisting of a pair of identifiers separated by the
#
character. The first of the two identifiers
is an ID
associated in the header of the PML instance with the system file name or
URL of the instance containing the referred object. The second of the
identifiers is a unique ID of the construct (or element) within the PML
(or XML) instance it occurs in.
If the referred construct is a structure,
then its ID is the value of its
member with the role #ID
.
If the referred construct is an element,
then its ID is the value of its
attribute with the role #ID
.
If the referred construct is an XML element in a
non-PML XML document, then its ID is the value of its
ID-attribute (e.g. either the attribute xml:id
or
some other attribute declared as ID in the document's DTD or schema).
PML references are suitable for stacking one layer of linguistic annotation upon another. For this purpose, the original text is usually transformed to a very simple PML instance that only adds the most essential features such as basic tokenization, identifiers of individual tokens, etc., providing the basis upon which further annotations could be stacked. If it is not possible or desirable to directly include tokens from the original text in such a base layer, then a suitable mechanism (currently not defined by PML) has to be employed in order to carry unambiguous references to the corresponding portions of the original text (regardless of the original format).
A specific PML schema is usually defined for each of the annotation
layers. The relation between annotation layers is typically
expressed on the instance level using PML references and on the
PML schema level using the instance binding (PML schema element reference
).
The XSLT stylesheet pml2rng.xsl transforms a PML schema to the corresponding Relax NG schema that can be used for validating instances of the PML schema. The resulting Relax NG refers to a portion of Relax NG common to all PML applications which is stored in the file pml_common.rng.
There are many standard freely available tools that can be used to validate an XML document against a Relax NG, such as jing or xmllint.
A Perl script pml_simplify implements a conversion from a modular PML schema to a simplified PML schema described in Section 7, “Processing modular PML schemas”.
The Tree Editor TrEd has built-in support for PML representation of dependency and constituency trees (see Section PMLBackend in TrEd User's Manual for details).
PML instances may also be processed using conventional XML-oriented tools without direct support for PML. One of them worth recommending is XSH, which is a versatile tool for XML processing.
In this appendix we provide a Relax NG schema for PML
Schema files (it is a listing of the file pml_schema.rng
).
Note that this Relax NG schema is rather simplistic and that
does not currently reflect all constraints implied on the syntax
of the PML schema file expressed in this document. In particular,
the Relax NG does not enforce
constraints on applicability of roles nor the requirement
that a named type may only be referred to in contexts where
the actual type represented by the named type is permitted.
<?xml version="1.0"?> <grammar xmlns="http://relaxng.org/ns/structure/1.0" xmlns:s="http://ufal.mff.cuni.cz/pdt/pml/schema/" xmlns:a="http://relaxng.org/ns/compatibility/annotations/1.0" datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes"> <a:documentation>PML schema syntax (revision 1.1.1)</a:documentation> <start> <element name="s:pml_schema"> <attribute name="version"> <choice><value>1.1</value></choice> </attribute> <optional> <element name="s:revision"> <text/> </element> </optional> <optional> <element name="s:description"> <text/> </element> </optional> <zeroOrMore> <ref name="reference.element"/> </zeroOrMore> <zeroOrMore> <ref name="import.element"/> </zeroOrMore> <zeroOrMore> <ref name="derive.element"/> </zeroOrMore> <optional> <ref name="root.element"/> </optional> <zeroOrMore> <ref name="type.element"/> </zeroOrMore> </element> </start> <define name="import.element"> <element name="s:import"> <a:documentation> instruction to import type(s) and root element from another PML schema </a:documentation> <attribute name="schema"/> <optional> <attribute name="type"> <data type="ID"/> </attribute> </optional> <optional> <choice> <attribute name="revision"/> <group> <optional> <attribute name="minimal_revision"/> </optional> <optional> <attribute name="maximal_revision"/> </optional> </group> </choice> </optional> </element> </define> <define name="derive.element"> <element name="s:derive"> <a:documentation> instruction to derive new type from an existing one (including one that is imported) </a:documentation> <attribute name="type"/> <optional> <attribute name="name"> <data type="ID"/> </attribute> </optional> <choice> <element name="s:structure"> <optional> <attribute name="name"/> </optional> <optional> <ref name="role.attribute"/> </optional> <zeroOrMore> <ref name="member.element"/> </zeroOrMore> <zeroOrMore> <ref name="delete.element"/> </zeroOrMore> </element> <element name="s:sequence"> <optional><attribute name="content_pattern"/></optional> <optional><ref name="role.attribute"/></optional> <zeroOrMore> <ref name="element.element"/> </zeroOrMore> <zeroOrMore> <ref name="delete.element"/> </zeroOrMore> </element> <element name="s:container"> <optional><ref name="role.attribute"/></optional> <zeroOrMore> <ref name="attribute.element"/> </zeroOrMore> <zeroOrMore> <ref name="delete.element"/> </zeroOrMore> </element> <element name="s:choice"> <zeroOrMore> <element name="s:value"> <text/> </element> </zeroOrMore> <zeroOrMore> <ref name="delete.element"/> </zeroOrMore> </element> </choice> </element> </define> <define name="delete.element"> <element name="s:delete"> <a:documentation>delete instruction in derive element (depending on the context, contains name of a structure member, sequence element, or choice value)</a:documentation> <text/> </element> </define> <define name="type.decl"> <choice> <ref name="type.attribute"/> <ref name="any.type"/> </choice> </define> <define name="any.type"> <choice> <ref name="alt.element"/> <ref name="list.element"/> <ref name="data.type"/> </choice> </define> <define name="data.type"> <choice> <ref name="choice.element"/> <ref name="constant.element"/> <ref name="structure.element"/> <ref name="sequence.element"/> <ref name="container.element"/> <ref name="cdata.element"/> </choice> </define> <define name="non-structure.type"> <choice> <ref name="alt.element"/> <ref name="list.element"/> <ref name="choice.element"/> <ref name="constant.element"/> <ref name="sequence.element"/> <ref name="cdata.element"/> </choice> </define> <define name="role.attribute"> <attribute name="role"> <a:documentation>PML role of the value</a:documentation> <choice> <value>#TREES</value> <value>#NODE</value> <value>#ORDER</value> <value>#CHILDNODES</value> <value>#ID</value> <value>#KNIT</value> <value>#HIDE</value> </choice> </attribute> </define> <define name="root.element"> <a:documentation>declaration of the root-element of a PML instance (except for the implicit obligatory <head>)</a:documentation> <element name="s:root"> <attribute name="name"/> <choice> <ref name="type.attribute"/> <ref name="sequence.element"/> <ref name="structure.element"/> <ref name="container.element"/> </choice> </element> </define> <define name="reference.element"> <element name="s:reference"> <a:documentation>declare a bound instance and optinally provide a hint for applications on how to parse it</a:documentation> <attribute name="name"/> <optional> <attribute name="readas"> <choice> <value>trees</value> <value>dom</value> </choice> </attribute> </optional> </element> </define> <define name="type.element"> <element name="s:type"> <a:documentation>a named complex type</a:documentation> <attribute name="name"> <data type="ID"/> </attribute> <optional> <ref name="role.attribute"/> </optional> <ref name="any.type"/> </element> </define> <define name="type.attribute"> <attribute name="type"> <a:documentation>a reference to a named complex type</a:documentation> <data type="IDREF"/> </attribute> </define> <define name="structure.element"> <element name="s:structure"> <a:documentation>a structure (AVS)</a:documentation> <optional> <attribute name="name"/> </optional> <optional> <ref name="role.attribute"/> </optional> <oneOrMore> <ref name="member.element"/> </oneOrMore> </element> </define> <define name="member.element"> <element name="s:member"> <a:documentation>a member of a structure</a:documentation> <optional><ref name="required.attribute"/></optional> <optional> <attribute name="as_attribute"> <choice> <value>0</value> <value>1</value> </choice> </attribute> </optional> <optional> <ref name="role.attribute"/> </optional> <attribute name="name"/> <ref name="type.decl"/> </element> </define> <define name="alt.element"> <element name="s:alt"> <a:documentation>an alternative of values of the same type</a:documentation> <choice> <ref name="type.attribute"/> <ref name="list.element"/> <ref name="data.type"/> </choice> </element> </define> <define name="list.element"> <element name="s:list"> <a:documentation>a list of values of the same type</a:documentation> <attribute name="ordered"> <choice> <value>1</value> <value>0</value> </choice> </attribute> <choice> <group> <attribute name="role"> <value>#KNIT</value> </attribute> <attribute name="type"> <a:documentation>a reference to a named complex type for knitting</a:documentation> <data type="IDREF"/> </attribute> <ref name="cdata.element"/> </group> <group> <optional> <ref name="role.attribute"/> </optional> <choice> <ref name="type.attribute"/> <ref name="alt.element"/> <ref name="data.type"/> </choice> </group> </choice> </element> </define> <define name="choice.element"> <element name="s:choice"> <a:documentation>enumerated type (atomic)</a:documentation> <oneOrMore> <element name="s:value"> <text/> </element> </oneOrMore> </element> </define> <define name="cdata.element"> <element name="s:cdata"> <a:documentation>cdata type (atomic)</a:documentation> <attribute name="format"> <choice> <value>any</value> <value>ID</value> <value>PMLREF</value> <value>string</value> <value>normalizedString</value> <value>token</value> <value>base64Binary</value> <value>hexBinary</value> <value>integer</value> <value>positiveInteger</value> <value>negativeInteger</value> <value>nonNegativeInteger</value> <value>nonPositiveInteger</value> <value>long</value> <value>unsignedLong</value> <value>int</value> <value>unsignedInt</value> <value>short</value> <value>unsignedShort</value> <value>byte</value> <value>unsignedByte</value> <value>decimal</value> <value>float</value> <value>double</value> <value>boolean</value> <value>duration</value> <value>dateTime</value> <value>date</value> <value>time</value> <value>gYear</value> <value>gYearMonth</value> <value>gMonth</value> <value>gMonthDay</value> <value>gDay</value> <value>Name</value> <value>NCName</value> <value>anyURI</value> <value>language</value> <value>IDREF</value> <value>IDREFS</value> <value>NMTOKEN</value> <value>NMTOKENS</value> </choice> </attribute> </element> </define> <define name="constant.element"> <element name="s:constant"> <a:documentation>a constant (atomic)</a:documentation> <text/> </element> </define> <define name="sequence.element"> <element name="s:sequence"> <a:documentation>a sequence of elements</a:documentation> <optional><ref name="role.attribute"/></optional> <optional><attribute name="content_pattern"/></optional> <choice> <oneOrMore> <ref name="element.element"/> </oneOrMore> <group> <interleave> <ref name="text.element"/> <oneOrMore> <ref name="element.element"/> </oneOrMore> </interleave> </group> </choice> </element> </define> <define name="text.element"> <element name="s:text"> <a:documentation>declare cdata for mixed-content sequence</a:documentation> <empty/> </element> </define> <define name="element.element"> <element name="s:element"> <a:documentation>an element of a sequence</a:documentation> <attribute name="name"/> <optional> <ref name="role.attribute"/> </optional> <ref name="type.decl"/> </element> </define> <define name="container.element"> <element name="s:container"> <a:documentation>a simple container type</a:documentation> <optional> <ref name="role.attribute"/> </optional> <zeroOrMore> <ref name="attribute.element"/> </zeroOrMore> <optional> <ref name="non-structure.type"/> </optional> </element> </define> <define name="attribute.element"> <element name="s:attribute"> <a:documentation>attribute of a container</a:documentation> <optional> <attribute name="required"> <choice> <value>0</value> <value>1</value> </choice> </attribute> </optional> <attribute name="name"/> <optional> <ref name="role.attribute"/> </optional> <choice> <ref name="type.attribute"/> <choice> <ref name="choice.element"/> <ref name="cdata.element"/> </choice> </choice> </element> </define> <define name="required.attribute"> <attribute name="required"> <choice> <value>0</value> <value>1</value> </choice> </attribute> </define> </grammar>
In this appendix we provide some simple examples of PML usage. Rather than on practical applicability of the schemas that follow, we concentrate on demonstrating the features, definitions and representation of various constructs. We also show how PML references work and how annotation layers can be stacked one upon another.
The following PML schema and instance show an application of PML to
a very simple analytical dependency annotation of English sentences. In
this example, the annotation consists of some meta data (annotator's
name and a time stamp) and a list of trees. Each tree is represented
by its root-node. Nodes are structures with two members bearing the
node-labels (word form and its syntactical function) and two technical
members (index of the node in the ordering of the tree - represented by
the attribute ord
, and a list of child-nodes -
represented by the element governs
). Note that if a
list of child-nodes has only one member, then this single child-node
may be directly represented by the governs
element.
This eliminates the need for an extra LM
bracketing
element. Note that PML doesn't actually distinguish between dependency
trees and constituency trees, but since dependency
trees are ordered trees and are not necessarily projective, we
have to employ an extra member ord
with PML-role
#ORDER
for the tree ordering.
Because we do not want any linguistic complexity to distract
the reader's attention from the technical aspects of how data
are defined and represented in PML, we have chosen two shamelessly simple sentences.
Example 1. PML schema (example1_schema.xml
)
<?xml version="1.0"?> <pml_schema version="1.1" xmlns="http://ufal.mff.cuni.cz/pdt/pml/schema/"> <description>Example of dependency tree annotation</description> <root name="annotation"> <structure> <member name="meta" type="meta.type"/> <member name="trees" role="#TREES" required="1"> <list type="node.type" ordered="1"/> </member> </structure> </root> <type name="meta.type"> <structure> <member name="annotator"><cdata format="any"/></member> <member name="datetime"><cdata format="any"/></member> </structure> </type> <type name="node.type"> <structure role="#NODE"> <member name="ord" as_attribute="1" required="1" role="#ORDER"> <cdata format="nonNegativeInteger"/> </member> <member name="func" type="func.type" required="1"/> <member name="form" required="1"> <cdata format="any"/> </member> <member name="governs" role="#CHILDNODES" required="0"> <list type="node.type" ordered="0"/> </member> </structure> </type> <type name="func.type"> <choice> <value>Pred</value> <value>Subj</value> <value>Obj</value> <value>Attrib</value> <value>Adv</value> </choice> </type> </pml_schema>
Example 2. Sample instance with the annotation of the sentence:
`John loves Mary. He told her this Friday.'
(example1.xml
)
<?xml version="1.0"?> <annotation xmlns="http://ufal.mff.cuni.cz/pdt/pml/"> <head> <schema href="example1_schema.xml"/> </head> <meta> <annotator>Jan Novak</annotator> <datetime>Sun May 1 18:56:55 2005</datetime> </meta> <trees> <LM ord="2"> <func>Pred</func> <form>loves</form> <governs> <LM ord="1"> <func>Subj</func> <form>John</form> </LM> <LM ord="3"> <func>Obj</func> <form>Mary</form> </LM> </governs> </LM> <LM ord="2"> <func>Pred</func> <form>told</form> <governs> <LM ord="1"> <func>Subj</func> <form>He</form> </LM> <LM ord="3"> <func>Obj</func> <form>her</form> </LM> <LM ord="5"> <func>Adv</func> <form>Friday</form> <governs ord="4"> <!-- ditto --> <func>Attrib</func> <form>this</form> </governs> </LM> </governs> </LM> </trees> </annotation>
On two simple (and of course incomplete) examples we demonstrate how Penn-treebank-like constituency trees might be represented in PML. This situation differs from the dependency trees in two aspects: 1) with constituency trees we do not have to consider an external ordering of the nodes in the tree, 2) constituency trees usually distinguish between leaf nodes (terminal nodes) and branching nodes (non-terminal nodes). In the first sample we deal with this by declaring two node types and using sequences instead of lists (as lists would require all members to be of the same type). In the second sample we provide a minimalist approach taking advantage of the fact that a non-terminal node has at most one terminal child, which in turn eliminates the need to represent leaf nodes as nodes at all. This, in combination with the possibility to reuse element names for the actual labels, provides a very compact XML notation very close to the labeled-bracket syntax of Penn Treebank.
Example 3. PML schema (example2_schema.xml
)
<?xml version="1.0"?> <pml_schema version="1.1" xmlns="http://ufal.mff.cuni.cz/pdt/pml/schema/"> <description>Example of constituency tree annotation</description> <root name="annotation"> <sequence role="#TREES" content_pattern="meta, nt+"> <element name="meta"> <structure> <member name="annotator"><cdata format="any"/></member> <member name="datetime"><cdata format="any"/></member> </structure> </element> <element name="nt" type="nonterminal.type"/> </sequence> </root> <type name="nonterminal.type"> <container role="#NODE"> <attribute name="label"> <choice> <value>S</value> <value>VP</value> <value>NP</value> <value>PP</value> <value>ADVP</value> <!-- etc. --> </choice> </attribute> <sequence role="#CHILDNODES"> <element name="nt" type="nonterminal.type"/> <element name="form" type="terminal.type"/> </sequence> </container> </type> <type name="terminal.type"> <container role="#NODE"> <cdata format="any"/> </container> </type> </pml_schema>
Example 4. Sample instance with annotation of the sentence:
`John loves Mary. He told her this Friday.' (example2.xml
)
<?xml version="1.0"?> <annotation xmlns="http://ufal.mff.cuni.cz/pdt/pml/"> <head> <schema href="example2_schema.xml"/> </head> <meta> <annotator>John Smith</annotator> <datetime>Sun May 1 18:56:55 2005</datetime> </meta> <nt label="S"> <nt label="NP"> <form>John</form> </nt> <nt label="VP"> <form>loves</form> <nt label="NP"> <form>Mary</form> </nt> </nt> </nt> <nt label="S"> <nt label="NP"> <form>He</form> </nt> <nt label="VP"> <form>told</form> <nt label="NP"><form>her</form></nt> <nt label="ADVP"><form>this Friday</form></nt> </nt> </nt> </annotation>
For brevity, we will not repeat the
meta
element in the second sample.
Example 5. PML schema (example3_schema.xml
)
<?xml version="1.0"?> <pml_schema version="1.1" xmlns="http://ufal.mff.cuni.cz/pdt/pml/schema/"> <description> Example of very compact constituency tree annotation </description> <root name="annotation"> <sequence role="#TREES"> <element name="S" type="nonterminal.type"/> </sequence> </root> <type name="nonterminal.type"> <container role="#NODE"> <attribute name="form"><cdata format="any"/></attribute> <sequence role="#CHILDNODES"> <element name="VP" role="#NODE" type="nonterminal.type"/> <element name="NP" role="#NODE" type="nonterminal.type"/> <element name="PP" role="#NODE" type="nonterminal.type"/> <element name="ADVP" role="#NODE" type="nonterminal.type"/> <!-- etc. --> </sequence> </container> </type> </pml_schema>
Example 6. Sample instance (example3.xml
)
<?xml version="1.0"?> <annotation xmlns="http://ufal.mff.cuni.cz/pdt/pml/"> <head><schema href="example3_schema.xml"/></head> <S> <NP form="John"/> <VP form="loves"> <NP form="Mary"/> </VP> </S> <S> <NP form="He"/> <VP form="told"> <NP form="her"/> <ADVP form="this Friday"/> </VP> </S> </annotation>
Note that once the labels of non-terminals coincide with the names of elements
representing nodes, we could apply further restrictions on
the nesting directly in the PML schema. For example, it would be very easy to
incorporate some grammar rules (such as that
ADVP
can only occur within VP
, etc.).
To demonstrate cross-referencing in a PML instance we define two simple PML schemas for representing arbitrary graph with both labeled nodes. In the first schema, we represent the graph by a list of its vertices and a list of its edges. With the second schema the same graph is represented by a list of structures for nodes consisting of a label and a a list of pointers to the nodes connected with the current node by an edge.
Example 7. PML schema (example4_schema.xml
)
<?xml version="1.0"?> <pml_schema version="1.1" xmlns="http://ufal.mff.cuni.cz/pdt/pml/schema/"> <description>An oriented graph</description> <root name="graph"> <structure> <member name="verteces"> <list ordered="0"> <structure> <member name="id" as_attribute="1" required="1" role="#ID"> <cdata format="ID"/> </member> <member name="label" required="1"> <cdata format="any"/> </member> </structure> </list> </member> <member name="edges"> <list ordered="0"> <structure> <member name="from.rf" required="1" as_attribute="1"> <cdata format="PMLREF"/> </member> <member name="to.rf" required="1" as_attribute="1"> <cdata format="PMLREF"/> </member> </structure> </list> </member> </structure> </root> </pml_schema>
Example 8. Sample instance (example4.xml
)
<?xml version="1.0"?> <graph xmlns="http://ufal.mff.cuni.cz/pdt/pml/"> <head><schema href="example4_schema.xml"/></head> <verteces> <LM id="v1"><label>A</label></LM> <LM id="v2"><label>B</label></LM> <LM id="v3"><label>A</label></LM> <LM id="v4"><label>C</label></LM> <LM id="v5"><label>D</label></LM> </verteces> <edges> <LM from.rf="v1" to.rf="v2"/> <LM from.rf="v1" to.rf="v3"/> <LM from.rf="v2" to.rf="v4"/> <LM from.rf="v3" to.rf="v4"/> <LM from.rf="v4" to.rf="v1"/> </edges> </graph>
Example 9. PML schema (example5_schema.xml
)
<?xml version="1.0"?> <pml_schema version="1.1" xmlns="http://ufal.mff.cuni.cz/pdt/pml/schema/"> <description>An oriented graph</description> <root name="graph"> <structure> <member name="body" required="1"> <list ordered="0"> <structure> <member name="id" as_attribute="1" required="1" role="#ID"> <cdata format="ID"/> </member> <member name="label" required="1"> <cdata format="any"/> </member> <member name="edges.rf"> <list ordered="0"> <cdata format="PMLREF"/> </list> </member> </structure> </list> </member> </structure> </root> </pml_schema>
Example 10. Sample instance (example5.xml
)
<?xml version="1.0"?> <graph xmlns="http://ufal.mff.cuni.cz/pdt/pml/"> <head><schema href="example5_schema.xml"/></head> <body> <LM id="v1"> <label>A</label> <edges.rf> <LM>v2</LM> <LM>v3</LM> </edges.rf> </LM> <LM id="v2"> <label>B</label> <edges.rf>v4</edges.rf> </LM> <LM id="v3"> <label>A</label> <edges.rf>v4</edges.rf> </LM> <LM id="v4"> <label>C</label> <edges.rf>v1</edges.rf> </LM> <LM id="v5"> <label>D</label> </LM> </body> </graph>
In this example we define PML schemas for two annotation layers. The first layer represents the tokenized text with the sentence-boundary markup. The second layer is a constituency-tree annotation of the sentences on the first (lower) layer. This constituency annotation is similar to the samples Section 2, “Constituency trees”, but this time the terminals contain references to the tokenized text.
The following schema and instance show a tokenization layer.
Example 11. PML schema (example6_schema.xml
)
<?xml version="1.0"?> <pml_schema version="1.1" xmlns="http://ufal.mff.cuni.cz/pdt/pml/schema/"> <revision>0.2</revision> <description>Example of tokenization layer</description> <root name="tokenization"> <structure> <member name="sentences"> <list ordered="1" type="sentence.type"/> </member> </structure> </root> <type name="sentence.type"> <structure> <member name="id" role="#ID" type="ID.type" required="1" as_attribute="1"/> <member name="tokens"> <!-- words (tokens) --> <sequence> <element name="w" type="w.type"/> </sequence> </member> </structure> </type> <type name="w.type"> <container> <attribute name="id" role="#ID" type="ID.type" required="1"/> <cdata format="any"/> </container> </type> <type name="ID.type"> <cdata format="ID"/> </type> </pml_schema>
Example 12. Sample instance (example6.xml
)
<?xml version="1.0"?> <tokenization xmlns="http://ufal.mff.cuni.cz/pdt/pml/"> <head><schema href="example6_schema.xml"/></head> <sentences> <LM id="s1"> <tokens> <w id="s1w1">John</w> <w id="s1w2">loves</w> <w id="s1w3">Mary</w> <w id="s1w4">.</w> </tokens> </LM> <LM id="s2"> <tokens> <w id="s2w1">He</w> <w id="s2w2">told</w> <w id="s2w3">her</w> <w id="s2w4">this</w> <w id="s2w5">Friday</w> <w id="s2w6">.</w> </tokens> </LM> </sentences> </tokenization>
The following schema and instance show an annotation
layer stacked over the previously defined tokenization layer.
The relation between units on these layers, represented
by PML references from the annotation layer to the tokenization layer,
may in general be N to M. The references to the tokenization
layer have role #KNIT
, which indicates
that applications may replace the member w.rf
containing the list of references with the corresponding
object from the lower layer (i.e. the w
element).
Example 13. PML schema (example7_schema.xml
)
<?xml version="1.0"?> <pml_schema version="1.1" xmlns="http://ufal.mff.cuni.cz/pdt/pml/schema/"> <description> Example of tree annotation over a tokenization layer </description> <reference name="tokenization" readas="dom"/> <root name="annotation"> <sequence role="#TREES"> <element name="S"> <container role="#NODE"> <attribute name="sentence.rf"> <cdata format="PMLREF"/> </attribute> <list ordered="1" role="#CHILDNODES" type="node.type"/> </container> </element> </sequence> </root> <type name="node.type"> <structure role="#NODE"> <member as_attribute="1" name="label"> <choice> <value>S</value> <value>VP</value> <value>NP</value> <value>PP</value> <value>ADVP</value> <!-- etc. --> </choice> </member> <member name="w.rf"> <list ordered="0" role="#KNIT" type="w.type"> <cdata format="PMLREF"/> </list> </member> <member name="constituents" role="#CHILDNODES"> <list ordered="1" type="node.type"/> </member> </structure> </type> <type name="w.type"> <container> <attribute name="id" role="#ID" required="1"> <cdata format="ID"/> </attribute> <cdata format="any"/> </container> </type> </pml_schema>
Example 14. Sample instance (example7.xml
)
<?xml version="1.0"?> <annotation xmlns="http://ufal.mff.cuni.cz/pdt/pml/"> <head> <schema href="example7_schema.xml"/> <references> <reffile name="tokenization" id="t" href="example6.xml"/> </references> </head> <S sentence.rf="s1"> <LM label="NP"><w.rf>t#s1w1</w.rf></LM> <LM label="VP"> <w.rf>t#s1w2</w.rf> <constituents label="NP"> <w.rf>t#s1w3</w.rf> </constituents> </LM> </S> <S sentence.rf="s2"> <LM label="NP"><w.rf>t#s2w1</w.rf></LM> <LM label="VP"> <w.rf>t#s2w2</w.rf> <constituents> <LM label="NP"><w.rf>t#s2w3</w.rf></LM> <LM label="ADVP"> <w.rf> <LM>t#s2w4</LM> <LM>t#s2w5</LM> </w.rf> </LM> </constituents> </LM> </S> </annotation>
After knitting is applied on PML references in
w.rf
, the instance appears to the application
as follows:
Example 15. Sample instance after knitting (example7_knit.xml
)
<?xml version="1.0"?> <annotation xmlns="http://ufal.mff.cuni.cz/pdt/pml/"> <head> <schema href="example7_schema.xml"/> <references> <reffile name="tokenization" id="t" href="example6.xml"/> </references> </head> <S sentence.rf="s1"> <LM label="NP"><w id="s1w1">John</w></LM> <LM label="VP"> <w id="s1w2">loves</w> <constituents label="NP"> <w id="s1w3">Mary</w> </constituents> </LM> </S> <S sentence.rf="s2"> <LM label="NP"><w id="s2w1">He</w></LM> <LM label="VP"> <w id="s2w2">told</w> <constituents> <LM label="NP"><w id="s2w3">her</w></LM> <LM label="ADVP"> <w> <LM id="s2w4">this</LM> <LM id="s2w5">Friday</LM> </w> </LM> </constituents> </LM> </S> </annotation>
In the last schema example example7_schema.xml
,
we have repeated the type w.type
from
example6_schema.xml
. In real-world cases,
copy-paste repetitions lead to many maintainability issues and
are not a good practice. Starting from version 1.1, PML
provides support for modularization of PML schemas
via import
and derive
instructions. Here is how example7_schema.xml
schema could be rewritten using these features:
Example 16. Modular version of the schema example7_schema.xml
(example8_schema.xml
).
<?xml version="1.0"?> <pml_schema version="1.1" xmlns="http://ufal.mff.cuni.cz/pdt/pml/schema/"> <revision>0.7</revision> <description> Example of tree annotation over a tokenization layer </description> <reference name="tokenization" readas="dom"/> <import type="w.type" schema="example6_schema.xml" revision="0.2"/> <root name="annotation" type="annotation.type"/> <type name="annotation.type"> <sequence role="#TREES"> <element name="S" type="S.type"/> </sequence> </type> <type name="S.type"> <container role="#NODE"> <attribute name="sentence.rf"> <cdata format="PMLREF"/> </attribute> <list ordered="1" role="#CHILDNODES" type="node.type"/> </container> </type> <type name="node.type"> <structure role="#NODE"> <member as_attribute="1" name="label" type="label.type"/> <member name="w.rf"> <list ordered="0" role="#KNIT" type="w.type"> <cdata format="PMLREF"/> </list> </member> <member name="constituents" role="#CHILDNODES"> <list ordered="1" type="node.type"/> </member> </structure> </type> <type name="label.type"> <choice> <value>S</value> <value>VP</value> <value>NP</value> <value>PP</value> <value>ADVP</value> <!-- etc. --> </choice> </type> </pml_schema>
The modularity allows for extending the imported schema.
The following example presents a PML schema based
on example8_schema.xml
but largely extended. It also borrows
a single type from example1_schema.xml
.
Example 17. Extending PML schemas (example9_schema.xml
).
<?xml version="1.0"?> <pml_schema version="1.1" xmlns="http://ufal.mff.cuni.cz/pdt/pml/schema/"> <revision>0.1</revision> <description> Extended example of tree annotation over a tokenization layer </description> <reference name="tokenization" readas="dom"/> <import schema="example8_schema.xml" minimal_revision="0.4" maximal_revision="1.0"/> <import schema="example1_schema.xml" type="meta.type"/> <derive type="annotation.type"> <sequence content_pattern="meta, S+"> <element name="meta" type="newmeta.type"/> </sequence> </derive> <derive type="S.type"> <container> <attribute name="annotators_comment"> <cdata format="any"/> </attribute> </container> </derive> <derive type="meta.type" name="changes.type"> <structure> <member name="id" role="#ID" as_attribute="1" required="1"><cdata format="ID"/></member> <member name="desc"><cdata format="any"/></member> </structure> </derive> <derive type="label.type"> <choice> <value>SDECL</value> <value>SIMP</value> <value>SQUEST</value> <delete>S</delete> </choice> </derive> <type name="newmeta.type"> <structure> <member name="lang"><cdata format="any"/></member> <member name="changes"><list type="changes.type" ordered="1"/></member> </structure> </type> </pml_schema>
Modular PML schemas can be converted back to non-modular,
simplified, PML schemas for easier processing. The
following schema is the simplified version
of example9_schema.xml
,
automatically generated
by the tool pml_simplify
,
mentioned in Section 11, “Tools”.
Example 18. Extending PML schemas (example10_schema.xml
).
<?xml version="1.0"?> <!-- Created by pml_simplify on Fri Jun 2 14:29:25 2006 Command-line: pml_simlify examples/example9_schema.xml examples/example10_schema.xml --> <pml_schema xmlns="http://ufal.mff.cuni.cz/pdt/pml/schema/" version="1.1"> <revision>0.1</revision> <description> Extended example of tree annotation over a tokenization layer </description> <reference name="tokenization" readas="dom"/> <!-- ============ import schema="example8_schema.xml" ============ --> <root name="annotation" type="annotation.type"/> <type name="w.type"> <container> <attribute name="id" role="#ID" type="ID.type" required="1"/> <cdata format="any"/> </container> </type> <type name="ID.type"> <cdata format="ID"/> </type> <!-- ============ derived from annotation.type ============ --> <type name="annotation.type"> <sequence role="#TREES" content_pattern="meta, S+"> <element name="S" type="S.type"/> <element name="meta" type="newmeta.type"/> </sequence> </type> <!-- ============ derived from S.type ============ --> <type name="S.type"> <container role="#NODE"> <attribute name="annotators_comment"> <cdata format="any"/> </attribute> <attribute name="sentence.rf"> <cdata format="PMLREF"/> </attribute> <list ordered="1" role="#CHILDNODES" type="node.type"/> </container> </type> <type name="node.type"> <structure role="#NODE"> <member as_attribute="1" name="label" type="label.type"/> <member name="w.rf"> <list ordered="0" role="#KNIT" type="w.type"> <cdata format="PMLREF"/> </list> </member> <member name="constituents" role="#CHILDNODES"> <list ordered="1" type="node.type"/> </member> </structure> </type> <!-- ============ derived from label.type ============ --> <type name="label.type"> <choice> <value>VP</value> <value>NP</value> <value>PP</value> <value>ADVP</value> <!-- etc. --> <value>SDECL</value> <value>SIMP</value> <value>SQUEST</value> </choice> </type> <!-- ============================================================= --> <!-- ============ import schema="example1_schema.xml" type="meta.type"============ --> <type name="meta.type"> <structure> <member name="annotator"> <cdata format="any"/> </member> <member name="datetime"> <cdata format="any"/> </member> </structure> </type> <!-- ============================================================================== --> <type name="newmeta.type"> <structure> <member name="lang"> <cdata format="any"/> </member> <member name="changes"> <list type="changes.type" ordered="1"/> </member> </structure> </type> <!-- ============ derived from meta.type ============ --> <type name="changes.type"> <structure> <member name="annotator"> <cdata format="any"/> </member> <member name="datetime"> <cdata format="any"/> </member> <member name="id" role="#ID" as_attribute="1" required="1"> <cdata format="ID"/> </member> <member name="desc"> <cdata format="any"/> </member> </structure> </type> </pml_schema>