Background data (adapted from Jenset & McGillivray 2017) for: Down-sampling from hierarchically structured corpus data

Name: Background data (adapted from Jenset & McGillivray 2017) for: Down-sampling from hierarchically structured corpus data
Creator: DataverseNO
Published: 2023-10-24 00:00:00
License: 暂无描述

doi.org2023-10-24 更新2025-01-08 收录

下载链接：

https://doi.org/10.18710/5KCE4U

下载链接

链接失效反馈

官方服务：

资源简介：

Dataset description This dataset, which is adapted from Jenset and McGillivray (2017), contains tabular files documenting the alternating usage of -(e)th and -(e)s to mark third-person verb inflection in Early Modern English. The data provided by Jenset and McGillivray (2017) are drawn from the PPCEME corpus (Kroch et al. 2004) and cover the period from 1500 to 1700. In total, 13,757 third-person singular tokens (excluding the verb BE) were annotated by these authors for a range of variables. For the purposes of the present methodological study, this dataset was reduced to a subset of 11,645 tokens, and the coding of variables was in some parts revised, completed, or modified. The dataset includes information about the Author and Verb Lemma, as well as a number of predictor variables, including Genre, Year, Frequency (of the verb lemma in the third-person singular), Phonological Context (stem-final sound), and the Gender of the author. Abstract for related publication Resource constraints often force researchers to down-size the list of tokens returned by a corpus query. This paper sketches a methodology for down-sampling and offers a survey of current practices. We build on earlier work and extend the evaluation of down-sampling designs to settings where tokens are clustered by text file and lexeme. Our case study deals with third-person present-tense verb inflection in Early Modern English and focuses on five predictors: Year, Gender, Genre, Frequency, and Phonological Context. We evaluate two strategies for selecting 2,000 (out of 11,645) tokens: simple down-sampling, where each hit has the same selection probability; and structured down-sampling, where this probability is inversely proportional to the author- and verb-specific token count. We form 500 sub-samples using each scheme and compare regression results to a reference model fit to the full set of cases. We observe that structured down-sampling shows better performance on several evaluation criteria.

本数据集源自Jenset与McGillivray（2017年）的研究，收录了关于早期现代英语中第三人称动词曲折变化中交替使用-(e)th与-(e)s标记的表格文件。Jenset与McGillivray（2017年）提供的数据来源于PPCEME语料库（Kroch等人，2004年），涵盖了1500年至1700年间的文献。作者们共标注了13,757个第三人称单数标记（不包括动词BE），涉及多种变量。针对本次方法论研究，本数据集被缩减至11,645个标记的子集，并对部分变量的编码进行了修订、完善或调整。数据集包含作者、动词词元以及一系列预测变量，包括体裁、年份、频率（动词词元在第三人称单数中的出现频率）、语音环境（词干末尾音素）和作者的性别。

提供机构：

DataverseNO