VPS-30-En (Verb Pattern Sample, 30 English verbs) is a newly developed lexical resource.
It contains descriptions of the following 30 English verbs:
If you would like to get all the data as a single package, please, write an e-mail to <krejcova(at)ufal.mff.cuni.cz>
All pattern definitions (lexicon entries) as well as all annotations we have produced are listed and links to the actual files in several formats are provided directly in the table.
The table is divided into three sections:
The first section Verb Entries contains alphabetically ordered verbs, number of their patterns in the Validation Database and pattern definitions. The entries are provided in both html format (only preview) and xml format.
The following two sections contain annotated corpus concordances. Section Lexicographer-revised Annotation contains
The column named adjudicated contains the adjudicated results of the last multiple annotation round. “Adjudicated” means that the lexicographer considered all values suggested by the annotators and eventually selected “the best one”, which remains the only one kept in the file. The files stored in this column constitute a single-value annotation based on the feedback from a multiple annotation. The adjudication was only performed for a round where the interannotator agreement came out reasonably good and the manual disagreement analysis did not reveal any obvious need of corrections of the verb entry. Whenever the outcome of the multiple annotation was not good enough and the entry needed a revision, the multiple annotation was discarded. The lexicographer revised the entry and updated the reference sample (“original”) to match the revised pattern definitions. The same was done to the sample that had been subject to the multiple annotation round that triggered the entry revision. Each annotation round that resulted in an entry revision produced one such sample. These samples are called trial. Usually, there is one or two per verb.
The last section - Adjudicated Multiple Annotation - contains the final annotation round with its multiple values – the one that was declared as satisfactory and after which no entry revisions followed. The lexicographer checked all the annotations and deleted evident errors. The record of the disagreement analysis is stored in the adjudication tables located in the samples column. It is available as html (just a preview without concordance ID numbers), xls (the original file with red-errors) and csv (without colors, but the erroneous values are not contained in the list of acceptable values). All files (save the html preview) also contain the BNC-native sentence IDs.
Section Adjudicated Multiple Annotation contains complementary information about the number of annotators (the annotators column), the interannotator agreement (the IAA column), as well as the confusion matrices (column confusion matrices) for the last annotation round.
The sample that is released along with the entry (“original”) is usually 250 concordances. More frequent or more complex verbs (e.g. say and throw) get a larger sample. The other samples (“trial” and “adjudicated”) contain 50 concordances each. The multiple annotation is only available for the last annotation round, i.e. 50 concordances.
By this procedure, we make sure that the 250‐concordance “original” sample, as well as all the subsequent 50‐concordance samples (“trial”), is in line with the last entry revision, and we merge them into an emerging gold standard sample for machine learning (in the table, they are kept separate). Consequently, we get at least 300 consistently annotated concordances for each verb. Entries of more complex verbs are based on a larger reference sample. In addition, we gain a multiple‐value annotation cleared of evident annotator errors ‐ typos or confusing a transitive pattern for an intransitive one, etc.
The infrastructure is provided by the CPA project of the Natural Language Processing Centre at Masaryk University in Brno, under the supervision of Pavel Rychlý, Adam Rambousek, and Vít Baisa.