SumeCzech is a 1-million-document dataset of Czech news, each consisting of:
For more details, please read our paper SumeCzech: Large Czech News-Based Summarization Dataset.
We distribute only the scripts capable of downloading the dataset from CommonCrawl. You can download them from LINDAT/CLARIAH-CZ repository.
Here we collect the published results of summarization methods on the SumeCzech dataset, using the published ROUGERAW metric. Note that the results differ from the original results reported in the paper, because the published metric uses slightly different tokenization.
Paper | System | Test set | Out-of-domain test set | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ROUGERAW-1 | ROUGERAW-2 | ROUGERAW-L | ROUGERAW-1 | ROUGERAW-2 | ROUGERAW-L | ||||||||||||||
P | R | F | P | R | F | P | R | F | P | R | F | P | R | F | P | R | F | ||
SumeCzech | first | 13.9 | 23.6 | 16.5 | 04.1 | 07.4 | 05.0 | 12.2 | 20.7 | 14.5 | 13.3 | 26.5 | 16.7 | 04.7 | 10.0 | 06.0 | 11.6 | 23.3 | 14.7 |
random | 11.0 | 17.8 | 12.8 | 02.6 | 04.5 | 03.1 | 09.6 | 15.5 | 11.1 | 10.6 | 20.7 | 13.1 | 03.2 | 06.9 | 04.1 | 09.3 | 18.2 | 11.5 | |
textrank | 13.3 | 22.8 | 15.9 | 03.7 | 06.8 | 04.6 | 11.6 | 19.9 | 13.8 | 12.8 | 25.9 | 16.3 | 04.5 | 09.6 | 05.7 | 11.3 | 22.7 | 14.2 | |
tensor2tensor | 20.2 | 15.9 | 17.2 | 06.7 | 05.1 | 05.6 | 18.6 | 14.7 | 15.8 | 19.4 | 15.1 | 16.3 | 07.1 | 05.2 | 05.7 | 18.1 | 14.1 | 15.2 |
Paper | System | Test set | Out-of-domain test set | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ROUGERAW-1 | ROUGERAW-2 | ROUGERAW-L | ROUGERAW-1 | ROUGERAW-2 | ROUGERAW-L | ||||||||||||||
P | R | F | P | R | F | P | R | F | P | R | F | P | R | F | P | R | F | ||
SumeCzech | first | 07.4 | 13.5 | 08.9 | 01.1 | 02.2 | 01.3 | 06.5 | 11.7 | 07.7 | 06.7 | 13.6 | 08.3 | 01.3 | 02.8 | 01.6 | 05.9 | 12.0 | 07.4 |
random | 05.9 | 10.3 | 06.9 | 00.5 | 01.0 | 00.6 | 05.2 | 08.9 | 06.0 | 05.2 | 10.0 | 06.3 | 00.6 | 01.4 | 00.8 | 04.6 | 08.9 | 05.6 | |
textrank | 06.0 | 16.5 | 08.3 | 00.8 | 02.3 | 01.1 | 05.0 | 13.8 | 06.9 | 05.8 | 16.9 | 08.1 | 01.1 | 03.4 | 01.5 | 05.0 | 14.5 | 06.9 | |
tensor2tensor | 08.8 | 07.0 | 07.5 | 00.8 | 00.6 | 00.7 | 08.1 | 06.5 | 07.0 | 06.3 | 05.1 | 05.5 | 00.5 | 00.4 | 00.4 | 05.9 | 04.8 | 05.1 | |
Bachelor Thesis of Müller, 2020 | Seq2seq-FT | 15.4 | 13.7 | 14.1 | 02.4 | 02.1 | 02.1 | 13.9 | 12.4 | 12.8 | 12.6 | 11.4 | 11.6 | 01.9 | 01.6 | 01.7 | 11.7 | 10.7 | 10.8 |
Seq2seq-FT-NER | 15.3 | 13.6 | 14.0 | 02.4 | 02.0 | 02.1 | 13.9 | 12.4 | 12.7 | 13.0 | 11.6 | 11.9 | 01.9 | 01.7 | 01.7 | 12.0 | 10.8 | 11.0 |
Paper | System | Test set | Out-of-domain test set | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ROUGERAW-1 | ROUGERAW-2 | ROUGERAW-L | ROUGERAW-1 | ROUGERAW-2 | ROUGERAW-L | ||||||||||||||
P | R | F | P | R | F | P | R | F | P | R | F | P | R | F | P | R | F | ||
SumeCzech | first | 13.1 | 17.9 | 14.4 | 01.9 | 02.8 | 02.1 | 08.8 | 12.0 | 09.6 | 11.1 | 17.1 | 12.7 | 01.6 | 02.7 | 01.9 | 07.6 | 11.7 | 08.7 |
random | 11.7 | 15.5 | 12.7 | 01.2 | 01.7 | 01.3 | 07.7 | 10.3 | 08.4 | 10.1 | 15.1 | 11.4 | 01.0 | 01.7 | 01.2 | 06.9 | 10.3 | 07.8 | |
textrank | 11.1 | 20.8 | 13.8 | 01.6 | 03.1 | 02.0 | 07.1 | 13.4 | 08.9 | 09.8 | 19.9 | 12.5 | 01.5 | 03.3 | 02.0 | 06.6 | 13.3 | 08.4 | |
tensor2tensor | 13.2 | 10.5 | 11.3 | 01.2 | 00.9 | 01.0 | 10.2 | 08.1 | 08.7 | 12.5 | 09.4 | 10.3 | 00.8 | 00.6 | 00.6 | 09.8 | 07.5 | 08.1 |
@inproceedings{straka-etal-2018-sumeczech, title = "{S}ume{C}zech: Large {C}zech News-Based Summarization Dataset", author = "Straka, Milan and Mediankin, Nikita and Kocmi, Tom and {\v{Z}}abokrtsk{\'y}, Zden{\v{e}}k and Hude{\v{c}}ek, Vojt{\v{e}}ch and Haji{\v{c}}, Jan", booktitle = "Proceedings of the Eleventh International Conference on Language Resources and Evaluation ({LREC} 2018)", month = may, year = "2018", address = "Miyazaki, Japan", publisher = "European Language Resources Association (ELRA)", url = "https://www.aclweb.org/anthology/L18-1551", }