Data

The Prague Discourse Treebank 4.0 is available to download from the LINDAT-Clarin repository in two data formats:

PDiT 4.0 in the PML data format, as a part of the PDT-C 2.0 (all underlying annotation layers + discourse annotation as a part of the tectogrammatical layer): http://hdl.handle.net/11234/1-5813 (see the Licence),
PDiT 4.0 in the PDTB 3.0 data format (raw texts + stand-off discourse annotation): http://hdl.handle.net/11234/1-5680 (see the Licence).

PDiT 4.0 in the Prague Markup Language data format

PDiT 4.0 in the Prague Markup Language data format (the primary format of the treebank) is a part of the Prague Dependency Treebank - Consolidated 2.0 (PDT-C 2.0; Hajič et al., 2024) [get it here: http://hdl.handle.net/11234/1-5813].

The Prague Markup Language (PML; Pajas and Štěpánek, 2008) is an XML based format for linguistic annotations (esp. treebanks). Annotation of each document in the PDT-C 2.0 is captured in four interlinked files, in accordance with the layer of annotation: word layer (files *.w.gz), morphological layer (*.m.gz), analytical layer (*.a.gz), and tectogrammatical layer(*.t.gz). Discourse relations are annotated on top of tectogrammatical (deep-syntax) dependency trees.

PDiT 4.0 in the PDTB 3.0 data format

From the original PML format, the PDiT data have been automatically transformed to the Penn Discourse Treebank 3.0 (PDTB 3.0; Prasad et al., 2019) format [get it here: http://hdl.handle.net/11234/1-5680].

The Penn Discourse Treebank column format is a stand-off annotation format. Each document is represented by two files, one being the raw text of the document (a sentence per line, paragraph boundaries as empty lines), the second one carrying the annotation with links to the raw text. Each discourse relation is represented by a single line consisting of a list of fields separated by |. Please find a description of the fields below in Appendix.

To conform to strict format requirements of the PDTB annotation tool Annotator (Lee et al., 2016), the data for each subcorpus are placed in two subdirectories, gold and raw, carrying the annotations and the original plain texts, respectively, and in both these directories, the data are further divided into numbered subdirectories (e.g., for the PDT part, ten subdirectories (01 ... 10) correspond to the original division to train-1 ... train-8, dtest, etest).

Data in numbers

Please note that some of the PDiT relations could not be represented in the PDTB format, e.g. relations in incomplete sentences such as "Kdybych věděl dříve, že je ti dobře" ["If I knew sooner that you are well"] (Faust subcorpus), where the missing continuation has a representation on the tectogrammatical layer but not in the plain text. Therefore the total numbers of PDiT discourse relations in the PDTB format are slightly lower than in the original PML format.

Subcorpora sizes and numbers of discourse relations in individual PDiT subcorpora
	corpus size (documents)	corpus size (sentences)**	corpus size (tokens)	discourse relations in PML format	discourse relations in PDTB format
PDT	3,165	49,419	833,195	21,611 (+ 443 list rels*)	21,537
PCEDT-cz	2,312	49,208	1,152,289	28,967	28,940
PDTSC	1,553	73,802	742,316	31,218	31,074
Faust	60	3,000	33,836	710	705
TOTAL	7,090	175,429	2,761,636	82,506 (+ 443 list rels*)	82,256

* list rels are relations connecting enumerated items (e.g. 1), 2)) and as such they represent another type of discourse phenomena than semantic discourse relations

** In the PML version of the PDT and PDTSC data, there are 9 and 33 empty trees, respectively. These empty trees are not included in the numbers of sentences in the corpora. (Including the empty trees, the numbers of sentences in the PDT and the PDTSC would be 49,428 and 73,835, respectively. The total number of sentences in PDT-C (i.e, in the PML version of PDiT) would then be 175,471).

How to browse the data

Data in the PML format

Tree editor TrEd (Pajas and Štěpánek, 2008) can be used to open and browse the data in the PML format. The editor can be downloaded for various platforms from its home page. Please follow the installation instructions specified at the page for your operating system.

After the installation, an extension needs to be installed:

Start TrEd.
In the top menu, select Setup -> Manage Extensions...; a dialog window with a list of installed extensions appears.
Click on the button "Get New Extensions"; a dialog window with a list of available (not yet installed) extensions appears.
Make sure that at least the extension "Prague Dependency Treebank - Consolidated 1.0 (pdtc10)" is checked to install (if it is not in the list, it may have already been installed).
Click on the button "Install Selected"; the selected extensions (along with their dependencies) get installed.
Close all TrEd windows including the main application window and start TrEd again.

To see the discourse annotation of a document on the tectogrammatical layer, open the respective file with extension .t.gz. By default, orange discourse arrows are displayed without any additional info. Press 'd' to see more discourse-related information.

In case of troubles with the installation of TrEd or with browsing the data, please contact the authors at (tred at ufal.mff.cuni.cz).

Data in the PDTB format

Please find a description of the fields of the PDTB stand-off format in the Appendix.

The data are compatible with the Penn Discourse Treebank annotation tool Annotator (Lee et al., 2016).

Appendix

The column format used in PDiT 4.0 consists of 44 fields. Fields 0-33 correspond to fields defined in the PDTB 3.0 and their description is taken from the PDTB 3.0 annotation manual (and slightly adopted for PDiT 4.0), fields 34-43 carry additional information added only in PDiT 4.0. Some of the fields are not used in PDiT 4.0 but they are kept for compatibility with the PDTB 3.0 format (they are marked as 'not used'):

0 - Relation Type - Explicit, AltLex, AltLexC
1 - Conn SpanList - SpanList of the Explicit Connective or the AltLex/AltLexC selection
2 - Conn Src - Connective’s Source (not used)
3 - Conn Type - Connective’s Type (not used)
4 - Conn Pol - Connective’s Polarity (not used)
5 - Conn Det - Connective’s Determinacy (not used)
6 - Conn Feat SpanList - Connective’s Feature SpanList (not used)
7 - Conn1 - Explicit Connective Head
8 - SClass1A - Semantic Class of the Connective
9 - SClass1B - Second Semantic Class of the First Connective (not used)
10 - Conn2 - Second Implicit Connective (not used)
11 - SClass2A - First Semantic Class of the Second Connective (not used)
12 - SClass2B - Second Semantic Class of the Second Connective (not used)
13 - Sup1 - SpanList SpanList of the First Argument’s Supplement
14 - Arg1 - SpanList SpanList of the First Argument
15 - Arg1 Src - First Argument’s Source (not used)
16 - Arg1 Type - First Argument’s Type (not used)
17 - Arg1 Pol - First Argument’s Polarity (not used)
18 - Arg1 Det - First Argument’s Determinacy (not used)
19 - Arg1 Feat SpanList - SpanList of the First Argument’s Feature (not used)
20 - Arg2 SpanList - SpanList of the Second Argument
21 - Arg2 Src - Second Argument’s Source (not used)
22 - Arg2 Type - Second Argument’s Type (not used)
23 - Arg2 Pol - Second Argument’s Polarity (not used)
24 - Arg2 Det - Second Argument’s Determinacy (not used)
25 - Arg2 Feat SpanList - SpanList of the Second Argument’s Feature (not used)
26 - Sup2 SpanList - SpanList of the Second Argument’s Supplement
27 - Adju Reason - The Adjudication Reason (not used)
28 - Adju Disagr - The type of the Adjudication disagreement (not used)
29 - PB Role - The PropBank role of the PropBank verb (not used)
30 - PB Verb - The PropBank verb of the main clause of this relation (not used)
31 - Offset - The Conn SpanList of Explicit/AltLex/AltLexC tokens
32 - Provenance - Indicates whether the token is a new PDTB3 token or has a corresponding PDTB2 token (not used)
33 - Link - The link id of the token
34 - Discourse Type - The original discourse type in the Prague taxonomy
35 - Conn Text - Text representation of field 31 (Offset)
36 - Conn Feat Text - Text representation of field 6 (Conn Feat SpanList) (not used)
37 - Sup1 Text - Text representation of field 13 (Sup1 SpanList)
38 - Arg1 Text - Text representation of field 14 (Arg1 SpanList)
39 - Arg1 Feat Text - Text representation of field 19 (Arg1 Feat SpanList) (not used)
40 - Arg2 Text - Text representation of field 20 (Arg2 SpanList)
41 - Arg2 Feat Text - Text representation of field 25 (Arg2 Feat SpanList) (not used)
42 - Sup2 Text - Text representation of field 26 (Sup2 SpanList)
43 - Genre - The genre of the document

References

Hajič, J. et al.: Prague Dependency Treebank - Consolidated 2.0 (PDT-C 2.0). Data/software, LINDAT-CLARIAH, URL: http://hdl.handle.net/11234/1-5813, 2024.

Lee, A., Prasad, R., Webber, B. and Joshi, A.: Annotating discourse relations with the PDTB Annotator. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations, pp. 121-125, 2016.

Pajas, P. and Štěpánek, J.: Recent Advances in a Feature-Rich Framework for Treebank Annotation. In: The 22nd International Conference on Computational Linguistics - Proceedings of the Conference, The Coling 2008 Organizing Committee, Manchester, UK, ISBN 978-1-905593-45-3, pp. 673-680, 2008.

Pajas, P. and Štěpánek, J.: System for Querying Syntactically Annotated Corpora. In: Proceedings of the ACL-IJCNLP 2009 Software Demonstrations, Association for Computational Linguistics, Suntec, Singapore, ISBN 1-932432-61-2, pp. 33-36, 2009.

Prasad, R., Webber, B., Lee, A. and Joshi, A.: Penn Discourse Treebank Version 3.0. Data/Software, Linguistic Data Consortium, University of Pennsylvania, Philadelphia, LDC2019T05, 2019.

Search form