VPS-30-En | ÚFAL

VPS-30-En (Verb Pattern Sample, 30 English verbs) is a newly developed lexical resource.
It contains descriptions of the following 30 English verbs:

Verb Entries				Lexicographer-revised Annotation			Adjudicated Multiple Annotation
#	verbs	patterns	pattern definitions	original	trial	adjudicated	annotators	samples	IAA	confusion matrices
1	access	8	html xml	orig-250	trial1	adj	4	html xls csv	0.600	txt
2	ally	6	html xml	orig-250	-	adj	4	html xls csv	0.710	txt
3	arrive	6	html xml	orig-250	-	adj	4	html xls csv	0.806	txt
4	breathe	17	html xml	orig-250	trial1 trial2	adj	4	html xls csv	0.793	txt
5	claim	9	html xml	orig-500	-	adj	4	html xls csv	0.764	txt
6	cool	14	html xml	orig-250	trial1	adj	4	html xls csv	0.843	txt
7	crush	19	html xml	orig-250	trial1 trial2	adj	4	html xls csv	0.549	txt
8	cry	18	html xml	orig-250	-	adj	4	html xls csv	0.754	txt
9	deny	10	html xml	orig-250	trial1	adj	4	html xls csv	0.651	txt
10	enlarge	4	html xml	orig-250	trial1	adj	4	html xls csv	0.536	txt
11	enlist	5	html xml	orig-250	trial1	adj	4	html xls csv	0.693	txt
12	forge	12	html xml	orig-250	trial1 trial2	adj	4	html xls csv	0.594	txt
13	furnish	7	html xml	orig-250	trial1	adj	4	html xls csv	0.773	txt
14	hail	9	html xml	orig-250	trial1	adj	4	html xls csv	0.727	txt
15	halt	3	html xml	orig-250	-	adj	4	html xls csv	0.540	txt
16	part	11	html xml	orig-250	trial1	adj	4	html xls csv	0.791	txt
17	plough	17	html xml	orig-250	-	adj	4	html xls csv	0.820	txt
18	plug	12	html xml	orig-250	trial1	adj	4	html xls csv	0.607	txt
19	pour	21	html xml	orig-250	trial1	adj	4	html xls csv	0.652	txt
20	say	14	html xml	orig-500	-	adj	4	html xls csv	0.798	txt
21	smash	10	html xml	orig-250	trial1	adj	4	html xls csv	0.657	txt
22	smell	9	html xml	orig-250	trial1	adj	4	html xls csv	0.746	txt
23	steer	22	html xml	orig-250	trial1	adj	4	html xls csv	0.572	txt
24	submit	5	html xml	orig-250	-	adj	4	html xls csv	0.764	txt
25	swell	23	html xml	orig-250	trial1	adj	4	html xls csv	0.765	txt
26	tell	19	html xml	orig-500	-	adj	4	html xls csv	0.715	txt
27	throw	72	html xml	orig-1000	-	adj	4	html xls csv	0.524	txt
28	trouble	13	html xml	orig-250	trial1	adj	4	html xls csv	0.693	txt
29	wake	10	html xml	orig-250	trial1	adj	4	html xls csv	0.717	txt
30	yield	11	html xml	orig-250	trial1	adj	4	html xls csv	0.716	txt

If you would like to get all the data as a single package, please, write an e-mail to <krejcova(at)ufal.mff.cuni.cz>

Data overview

All pattern definitions (lexicon entries) as well as all annotations we have produced are listed and links to the actual files in several formats are provided directly in the table.

The table is divided into three sections:

Verb Entries
Lexicographer-revised Annotation
Adjudicated Multiple Annotation

The first section Verb Entries contains alphabetically ordered verbs, number of their patterns in the Validation Database and pattern definitions. The entries are provided in both html format (only preview) and xml format.

The following two sections contain annotated corpus concordances. Section Lexicographer-revised Annotation contains

the original (reference) sample created by the lexicographer alone during the process of entry compilation (the original column)
random sample(s) that the annotators got to annotate according to the definitions and the original sample (columns trial and adjudicated).

The column named adjudicated contains the adjudicated results of the last multiple annotation round. “Adjudicated” means that the lexicographer considered all values suggested by the annotators and eventually selected “the best one”, which remains the only one kept in the file. The files stored in this column constitute a single-value annotation based on the feedback from a multiple annotation. The adjudication was only performed for a round where the interannotator agreement came out reasonably good and the manual disagreement analysis did not reveal any obvious need of corrections of the verb entry. Whenever the outcome of the multiple annotation was not good enough and the entry needed a revision, the multiple annotation was discarded. The lexicographer revised the entry and updated the reference sample (“original”) to match the revised pattern definitions. The same was done to the sample that had been subject to the multiple annotation round that triggered the entry revision. Each annotation round that resulted in an entry revision produced one such sample. These samples are called trial. Usually, there is one or two per verb.

The last section - Adjudicated Multiple Annotation - contains the final annotation round with its multiple values – the one that was declared as satisfactory and after which no entry revisions followed. The lexicographer checked all the annotations and deleted evident errors. The record of the disagreement analysis is stored in the adjudication tables located in the samples column. It is available as html (just a preview without concordance ID numbers), xls (the original file with red-errors) and csv (without colors, but the erroneous values are not contained in the list of acceptable values). All files (save the html preview) also contain the BNC-native sentence IDs.

Section Adjudicated Multiple Annotation contains complementary information about the number of annotators (the annotators column), the interannotator agreement (the IAA column), as well as the confusion matrices (column confusion matrices) for the last annotation round.

Usual sample size

The sample that is released along with the entry (“original”) is usually 250 concordances. More frequent or more complex verbs (e.g. say and throw) get a larger sample. The other samples (“trial” and “adjudicated”) contain 50 concordances each. The multiple annotation is only available for the last annotation round, i.e. 50 concordances.

Patterns Compilation Procedure

There are three annotators and one lexicographer. The lexicographer is in charge of the revisions of the entry as well as of keeping all annotation samples in line with the current pattern definitions. This includes the analysis of interannotator disagreements after each annotation round and the adjudication of the last annotation. The lexicographer also does the multiple annotations.
The annotators receive the entry along with the 250‐concordance reference sample. They get to annotate another 50‐concordance set, using the knowledge of the entry, the reference sample and the manual together.
Interannotator agreement (IAA) is measured, confusion matrices are computed for each annotator pair and disagreements are manually analyzed.
When the interannotator confusion suggests that a revision of the entry is desirable, the lexicographer revises the entry, the 250‐concordance (“original”) as well as the 50‐concordance sample. The annotators get them along with a new 50‐concordance sample for annotation. (In other words, the sample was not approved as the final multiple annotation, hence it has become a “trial”.) This procedure could be repeated as long as the agreement is low and the entry is identified as the problem, but in practice we have faced three rounds at worst.
When the IAA is satisfactory and the entry does not require any further modifications, the lexicographer makes the final disagreement analysis. A record of the analysis is kept (column Adjudicated Multiple Annotation/samples in the table). This record contains the multiple values for each concordance. The lexicographer marks evident errors and selects one “best” value.

By this procedure, we make sure that the 250‐concordance “original” sample, as well as all the subsequent 50‐concordance samples (“trial”), is in line with the last entry revision, and we merge them into an emerging gold standard sample for machine learning (in the table, they are kept separate). Consequently, we get at least 300 consistently annotated concordances for each verb. Entries of more complex verbs are based on a larger reference sample. In addition, we gain a multiple‐value annotation cleared of evident annotator errors ‐ typos or confusing a transitive pattern for an intransitive one, etc.

Annotation Scheme Description

Pattern Definition Form - [pdf]
- Pattern Definition XML Format - [dtd]
Annotation Manual - [pdf]

Infrastructure

The infrastructure is provided by the CPA project of the Natural Language Processing Centre at Masaryk University in Brno, under the supervision of Pavel Rychlý, Adam Rambousek, and Vít Baisa.

Semantic Pattern Recognition

semantic analysis of words in contexts

Search form

Data overview

Usual sample size

Patterns Compilation Procedure

Annotation Scheme Description

Infrastructure