UFAL Medical Corpus is a collection of parallel corpora assembled during the course of projects KConnect, Khresmoi and HimL aiming at more reliable machine translation of medical texts. The collected corpora are described in HimL project Deliverable D1.1 Report on Building Translation Systems for Public Health Domain (see the list of project results) and in KConnect project Deliverable D1.2 Toolkit and Report for Translator Adaptation to New Languages (the list of project results is available here).
UFAL Medical Corpus covers following languages: Czech, German, Spanish, French, Hungarian, Polish, Romanian and Swedish. Each language is paired with English.
UFAL Medical Corpus v.1.0 also serves as the training data for WMT17 Biomedical Task. For this purpose, we somewhat restricted the set of sentences due to copyright reasons.
We have combined data from various in-domain and out-of-domain sources into a single corpus. Duplicate senteces were excluded and the resulting corpus was shuffled.
UFAL Medical Corpus has following format: source_sentence [tab]
Source_sentence and target_sentence are dictionary entries in case of dictionaries.
Type_of_data can have folloing values: medical_corpus, general_corpus, medical_dictionary, general_dictionary.
The following table summarizes medical-domain corpora included in the UFAL Medical Corpus collection:
Corpora | cs-en | de-en | es-en | fr-en | hu-en | pl-en | ro-en | sv-en |
CESTA | - | - | - | 3,617 | - | - | - | - |
ECDC | 2,324 | 2,379 | 2,357 | 2,377 | 2,306 | 2,202 | 2,363 | 2,345 |
EMEA (OpenSubtitles) | 445,365 | 481,443 | 487,901 | 493,933 | 462,541 | 459,225 | 424,904 | 466,108 |
EMEA (new crawl) | 687,635 | 615,256 | - | - | - | 652,336 | 621,490 | - |
Medical Web Crawl | - | - | 148,982 | - | - | - | - | - |
Medical Web Texts from CzEng 1.6 | 7,029 | - | - | - | - | - | - | - |
MuchMore | - | 28,919 | - | - | - | - | - | - |
PatTR Medical | - | 1,830,647 | - | 2,191,537 | - | - | - | - |
Subtitles | 3,140 | 77,937 | 151,675 | 120,841 | - | 3,010 | 116,335 | 96,575 |
Total Parallel Segments | 1,145,493 | 3,036,581 | 790,915 | 2,812,305 | 464,847 | 1,116,773 | 1,165,092 | 565,028 |
Total Parallel Segments (after 'sort | uniq') | 819,697 | 2,662,810 | 631,087 | 2,634,229 | 351,336 | 800,662 | 852,800 | 444,777 |
Total Words (target language/en) | 14M/15M | 84M/94M | 9M/10M | 89M/100M | 5M/5M | 14M/14M | 14M/15M | 6M/5M |
We also included general domain data in the release. The following table summarizes the general purpose corpora included in the UFAL Medical Corpus collection:
Corpora | cs-en | de-en | es-en | fr-en | hu-en | pl-en | ro-en | sv-en |
Cordis | - | - | - | - | - | 168,067 | - | - |
EUbookshop | 428,339 | 9,011,774 | 5,103,274 | 10,225,247 | 412,618 | 509,105 | 310,653 | 1,877,976 |
EUROPARL | 643,361 | 1,918,724 | 1,964,134 | 2,006,305 | 621,328 | 627,367 | 396,882 | 1,852,450 |
Hunglish | - | - | - | - | 2,083,159 | - | - | - |
JRC-Acquis | 1,113,649 | 642,797 | 720,201 | 720,747 | 449,361 | 1,412,095 | 428,618 | 708,759 |
MultiUN | - | 153,545 | 7,734,469 | 11,840,859 | - | - | - | - |
News Commentary | 146,135 | 200,534 | 193,665 | 182,645 | - | - | - | - |
OpenSubtitles | 44,618,012 | 12,815,341 | 75,947,825 | 49,035,989 | 44,612,969 | 34,926,913 | 59,732,934 | 17,840,535 |
PatTR Other | - | 9,302,172 | - |
10,957,584 |
- | - | - | - |
Rapid | - | - | - | - | - | 132,156 | - | - |
Total Parallel Segments | 46,949,496 | 34,034,887 | 91,663,568 | 84,969,376 | 48,179,435 | 37,775,703 | 60,869,087 | 22,279,720 |
Total Parallel Segments (after 'sort | uniq') | 38,065,775 | 31,638,916 | 75,421,729 | 74,045,053 | 39,499,594 | 31,786,926 | 47,829,602 | 19,447,606 |
Total Words (target language/en) | 276M/333M | 716M/817M | 874M/889M | 1,392M/1,490M | 340M/262M | 288M/229M | 402M/377M | 221M/195M |
Dictionaries | cs-en | de-en | es-en | fr-en | hu-en | pl-en | ro-en | sv-en |
DBpedia | 148,181 | 681,494 | 544,686 | 44,977 | 139,329 | 549,600 | - | 297,913 |
Linguee | - | 51,571 | - | - | - | - | - | - |
To download UFAL Medical Corpus v.1.0, you have to register by filling in the following form. We will send you a unique username to access the files. If you do not hear from us within a week, fill the form again or contact us directly.
After the registration, you will have received a unique username. The unique username and a shared password "ufalmedi" will be requested at the following link:
Tip for Linux wget tool: Use the flags --user=YOUR-USERNAME --password=ufalmedi
to pass the authorization check. Use the flag --continue
to continue an interrupted transfer.
WARNING: Due to the processing error, corpus entries extracted from the Medical_Web_Texts_from_CzEng1.6 contain an additional column with a score, resulting in a following line format:
source_sentence [tab]
We gracefully acknowledge support from: