This section briefly introduces the existing I/O backends and then provides a quick introduction to writing a custom backend.
Actually, this is not a I/O backend per se. It is a base class
underlying most of the other I/O backends. It provides a simple
abstraction layer over some common low-level tasks, such as pipe-line redirection,
gzip-compression, URL resolving, as well as fetching and
uploading files using remote protocols. Due to this backend,
TrEd transparently handles
gzip-compression/de-compression of files with .gz
extension as well as remote file transfer over
ftp://
,
http://
,
ssh://
and other protocols. Note however that availability
of remote file transfer highly depends on a particular setup
(currently only UNIX systems are fully supported) and may require
some external tools, such as curl,
kioclient,
ssh,
gzip.
FSBackends deals with a format called FS (feature structure). FS-format
was the first format supported by tred, and for a long time also the only one.
As a result, some of the Perl structures such as FSNode
,
FSFile
used by TrEd as well as the underlying Perl library
Fslib.pm
itself bear its name (even though they
are now equally used to represent data obtained from any other I/O
backend).
FS format provides a simple and effective way for storing trees.
In this format, each node of the
tree uses the same set of attributes declared in the FS-file header.
FS format supports string values,
enumerated values and flat lists of these
(i.e. strings consisting of a |
-separated list of
values of the first two types). There is no direct support
for nested AVS structures, complex lists and alternatives.
Recent versions of
TrEd and Fslib
provide a simple nested-AVS emulation
(attiribute-value structure) for FS attribute values, meaning the
following: an attribute whose name contains one
or more slashes is represented as a (possibly nested) AVS structure where
each slash represents one level of nesting. Attributes sharing a
common name-part followed by a slash are thus represented as members of
the same structure. For example, attirubtes a
,
b/u/x
,
b/v/x
and
b/v/y
result in the following structure:
{
a => value-of-a
,
b => { u => { x => value-of-a/u/x
},
v => { x => value-of-a/v/x
,
y => value-of-a/v/y
}
}
}
In case that attributes with names a
a/b
would both exists, the nested-AVS emulation is
abandoned and all attributes are represented literally (i.e.
with slashes in their names).
Even with the recently added nested-AVS emulation, FS format is not in general fully capable of capturing data originating in other backends such as PML.
FS format is fully described in http://ufal.mff.cuni.cz/pdt/Corpora/PDT_1.0/Doc/fs.html
This backend is based on an excelent Perl module
Storable
, which can store
and retrieve almost arbitrary Perl structure in an instant. Because of its
general nature, this backend can save and keep intact files
originating from any other backend. Due to its speed
with which it saves and retrieves data, it is often
useful to transfer all data temporarily into this format
(e.g. using btred) and
revert to the original format only after all work on the data is done.
This backend provides
support for the CSTS format. CSTS stands for “Czech
sentence tree structure”. Since it is an application of SGML,
the backend requires an external sgml parser (namely
nsgmls) and a
document type definition (DTD)
(csts.doctype
). This backend represents
CSTS data as trees with a fixed set of attributes, specific to
the purpose for which CSTS was created, namely morphological and
syntactical annotation of Czech texts.
CSTS has been the primary format of the Prague Dependency
Treebank 1.0.
This backend used for
exchanging of data between TrEd and
btred servers, or in other words to “peek”
into the memory of running btred servers.
This backend only accepts filenames (or we should rather
say URLs) starting with the ntred://
protocol specification and followed by a real file name. When
this backend opens a file, it uses ntred
client to fetch the file from any currently running btred
server. If some btred server has a in-memory copy of
the specified file (possibly edited during previous
ntred requests), it sends this copy to the
client and via NTREDBackend
to
TrEd. Saving is performed in a
reversed way, i.e. ntred client
is used to communicate the (possibly edited) file back
to the btred server replaces its previous in-memory copy of the
file with the file obtained in this way.
Instead of requesting a whole file from the
servers, is also possible (and sometimes faster) to request
only a single tree. This can be achieved by appending
a suffix of the form
@
to the
N
ntred://
URL, where
N
is the number of the requested
tree (counting from 1). In that case,
TrEd obtains a file containing only
the requested tree. So, such URL as
ntred://
,
opens a file containing the 3rd tree in filename
@3##1.4filename
(as
represented in memory of a btred server) and opens that
file on the 4th node of its first and only tree. URLs of this
form are produced by the TredMacro
function NPosition()
(see Section 14.8, “Public API: pre-defined macros”).
This backend provies partial support for a new generic XML-based format called PML (the Prague Markup Language), used for capturing rich linguistic annotation. This format is the primary format of the Prague Dependency Treebank 2.0. Each application of PML is described using a special XML file called PML schema. This schema file defines which elements and attributes construe the nodes and structure of the tree, it defines value types of node attributes, etc.
Updated version of the full PML specification can be found on the PML project page. Note however that PML is an on-going project, so consider it as “work in progress”.
Because of the generic nature of PML, the
PMLBackend
of TrEd
is restricted only to those applications of PML which satisfy the
following criteria:
there is exactly one PML sequence or PML element of
a list type with PML role #TREES
and this element appears
under the root element.
If the entity with role #TREES
is a
PML sequence, then all its members are elements with
role #NODE
. If it is a PML element
of a list type, then all members of the list
are PML structures with role #NODE
.
Each PML element with role #NODE
may contain a sequence of its child elements.
This sequence must have PML role
#CHILDNODES
.
Each PML structure with role #NODE
may contain a member of a list type, which constitutes the list of child-nodes.
This member must have PML role #CHILDNODES
.
TrEd represents PML data types
in a very natural way. PML structures are represented as AVS
structures, PML lists as Fslib::List
objects and PML alternatives as Fslib::Alt
objects. Other non-atomic PML types, such as PML elements,
text data in mixed content, attributes, etc.,
are represented as AVS with the following four special members:
#type
(type of the entity,
e.g. element
,
text
),
#name
(name of the entity)
#ns
(XML namespace),
#content
(content of the entity).
Attributes of a PML element are represented as additional members
of the AVS representing the element.
If the PML schema declares a reference to an external resource
and this declaration bears the attribute
readas="dom"
, then
PMLBackend
loads the corresponding resource for the PML instance
as a DOM (Document Object Model) tree (using the Perl module
XML::LibXML
) and attaches this DOM tree
to the application data section of the in-memory
representation of the file.
If the PML schema declares a reference to an external resource
and this declaration bears the attribute
readas="trees"
, then
PMLBackend
passes the file-name
of the corresponding resource to
TrEd
and TrEd loads it as an ordinary
file. This file can be edited and treated as any other file in
TrEd. In btred,
this file is opened as a so called
secondary file, i.e. a file
which is not implicitly processed by the macro specified by
user, but as it is loaded in memory, the macro may explicitly
choose to process it.
PMLBackend
also supports so-called “knitting”
of PML instances, i.e. replacing certain type of PML
references with the content of the referenced entities
occurring in another PML instance.
Conversely, when a PML instance on which this
“knitting”
has been applied, the (possible edited) content
replaces the content of the referenced entities
in its original PML instance.
Knitting only applies to:
members of PML structures containing a
PML reference and having PML role #KNIT
,
to members of PML structures containing a list with PML
role #KNIT
, with PML references as list
members.
If such a member is encountered, then a possible trailing
.rf
part of its name is removed and its
content (one or more PML references) is replaced with the
corresponding entities in the referenced PML
instance. It is required that all such PML references refer to
resources specified in the PML schema as
readas="dom"
.
TrXMLBackend
was intended as a XML
replacement of the FS format. Unfortunatelly, never fully
developed and thoroughly tested. This may still happen in the
future but untill then, it is not recommended for serious
work.
This backend reads and stores trees represented in a specific subset of the TEI XML format. This format was (is?) used in the Slovene Treebank Project.
An I/O backend is a Perl module defining at least the following five subroutines (listed in the order in which they are typically called by TrEd):
test
($filename
,$encoding
)
This function should only quickly peek in the given file
in order to determine if it is a file suitable by the
backend. If this function accepts the file by returns a
defined non-zero value (e.g. 1), then the file is
processed by this backend. If the file is not suitable
for the backend, this function must reject the file
by returning 0 or undef
,
so that other backends in the list of backends could
try their luck.
open_backend
($filename
,$mode
,$encoding
)
This function should open and return a filehandle for a given
file. If $mode
is
r
, then this filehandle should be
open for reading, if $mode
is
w
, it should be open for writing.
The third, $encoding
, contains
the encoding specified by the user in the defaultFileEncoding
configuration option. This information may be ignored if
the data format provides another way to determine the
encoding. Most backends do not re-implement this
function, but simply import (i.e. inherit) it from the
base class IOBackend
.
read
($filehandle
,$fsfile
)
This is the key function that implements converting data
from the specific data format to the corresponding
memory representation in TrEd.
This function obtains two arguments: the
$filehandle
previously
obtained by a call to backend's open_backend
,
and an empty FSFile
object (i.e.
with no trees). It is supposed to parse the data format,
build tree representation of the data (usually using
functions such as FSNode->new()
,
and
and populate the FSlib::Paste
($child
,$parent
,$fsfile
->FS)FSFile
with the resulting trees (e.g. using its
changeTrees
method).
It should also setup FSFormat
object associated with the
$fsfile
(
).
Any additional information related to the file (but not representable
as trees or $fsfile
->FSFSFormat
) may be attached to the file e.g. using
.
$fsfile
->changeMetaData(
$key
,$value
)
write
($filehandle
,$fsfile
)
This function is the opposite of
read
. By examining the
FSFile
object
$fsfile
(especially its trees and meta data),
it should write the corresponding representation
in the specific data format to the given
$filehandle
.
close_backend
($filehandle
)
This function should close a given filehandle
created by a previous call to
open_backend
. It usually only
consists of applying a Perl function
close
on the filehandle, but if
additional cleanup is necessary, it should be done here.
Most backends do not re-implement this function, but
simply import (i.e. inherit) it from the base class
IOBackend
.
There are several ways to make TrEd know about a user-defined I/O backend, namely:
listing addtional backends in the IOBackends configuration option,
listing addtional backends after
-B
on the command-line (see Section 13, “Command-line options”
defining a get_backends_hook