Multiword expressions in the Prague Dependency Treebank 2.0

Annotation of Multiword Expressions and Multiword Named Entities in the PDT 2.0

The Prague Dependency Treebank 2.0 (PDT 2.0) contains a large amount of Czech texts with complex and interlinked morphological (2 million words), syntactic (1.5 MW) and complex semantic annotation (0.8 MW); in addition, certain properties of sentence information structure and coreference relations are annotated at the semantic level.

This dataset adds annotation of multiword expressions and multiword named entities to the original PDT 2.0 data. The annotation is stand-off, stored in the same PML format as the original PDT 2.0 data. It is to be used together with the PDT 2.0.

There is also a tectogrammatical MWE lexicon SemLex: lexicon of all the MWEs annotated. The lexicon is a work in progress: It is complete in terms of coverage of the data. All the entries also include the basic form of the expressions, a simplified dependency structure, and some other attributes. On the other hand only a few entries have a proper gloss, example sentence, synonyms (if applicable) and some other attributes.

Authors Pavel Straňák, Eduard Bejček
Supported by grant 1ET201120505 of the Academy of Sciences of the Czech Republic and 
grant MSM0021620838 of the Ministry of Youth, Education and Sport of The Czech Republic
Status published within PDT 2.5 (only gold data, not the parallel annotation)
Annotation (data) Download all parallel annotations (without any corrections) from three annotators. It is in a stand-off PML format.
SemLex (lexicon of MWEs) Download
License This work is licenced under a Creative Commons LicenceCreative Commons License 
PDT 2.0 itself is not a part of this dataset. To use the PDT 2.0, a valid PDT License is required. 
PDT 2.5 (with gold MWE annotation) is, however, licenced under CC.
Annotation Tool SemAnn (username and password 'public')
Visualisation+Search TrEd extension is available: install and run TrEd, Setup→Manage Extensions→Get New Extensions→"Display st-data in the tectogrammatical trees". 
The developement repository of the extension is also public.
Publications