Tags:

Czech-English Manual Word Alignment

Author: David Mareček

License: Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

Download the data from the LINDAT repository: http://hdl.handle.net/11234/1-1804

This corpus of manually aligned Czech-English parallel sentences have been created in 2008. Its purpose was to test alignment quality of automatic word-alignment tools. The data were revised and uploaded to LINDAT in 2016. The corpus comprises 2500 parallel sentences from 7 different sources.

dirname	source	#chunks	#sentences	#ENtokens	#CStokens	#alltokens
celex	Acquis Communautaire	10	501	13,512	10,752	24,264
rd	Reader’s Digest	7	350	6,294	5,792	12,086
project_syndicate	Project Syndicate	10	484	10,714	9,990	20,704
kacenka	Kačenka	2	100	3,006	2,553	5,559
books	E-Books	1	50	797	633	1,430
named_entities	Project Syndicate with NE	168	500	12,799	11,052	23,871
pcedt	PCEDT	22	515	12,697	12,174	24,871
	total	190	2500	59,819	52,946	112,765

The description and links to the individual data sources, annotation guidelines and the description of the annotation procedure is in Chapter 4 of David Mareček's diploma thesis, which is included in the package.

Manually aligned data are in directory 'data'.
Automatically merged alignments by two different annotators are in directory 'merged_data'.
You can browse the manual alignments using program ALPACO.

Usage: perl tools/alpaco.pl

Annotators (file extensions in brackets): Ondřej Bojar (.o.wa), Magdalena Prokopová (.p.wa), Martin Popel (.m.wa), Zuzana Škardová (.z.wa), Markus Giger (.g.wa), Jiří Januška (.j.wa)

How to cite

If you make use of the CzEnAli corpus, please cite my thesis:

@mastersthesis{ marecek:2008,
  title = {Automatic Alignment of Tectogrammatical Trees from Czech-English Parallel Corpus},
  author = {David Mare{\v{c}}ek},
  year = {2008},
  school = {Charles University},
  address = {{MFF} {UK}},
}

Search form

Czech-English Manual Word Alignment

How to cite