five

Background data for: Advancing our understanding of dispersion measures in corpus research

收藏
doi.org2024-11-26 更新2025-03-23 收录
下载链接:
https://doi.org/10.18710/FVHTFM
下载链接
链接失效反馈
官方服务:
资源简介:
Dataset description This dataset contains background data and supplementary material for Sönning (forthcoming), a study that looks at the behavior of dispersion measures when applied to text-level frequency data. For the literature survey reported in that study, which examines how dispersion measures are used in corpus-based work, it includes tabular files listing the 730 research articles that were examined as well as annotations for those studies that measured dispersion in the corpus-linguistic (and lexicographic) sense. As for the corpus data that were used to train the statistical model parameters underlying the simulation study reported in that paper, the dataset contains a term-document matrix for the 49,604 unique word forms (after conversion to lower-case) that occur in the Brown Corpus. Further, R scripts are included that document in detail how the Brown Corpus XML files, which are available from the Natural Language Toolkit (Bird et al. 2009; https://www.nltk.org/), were processed to produce this data arrangement. Abstract: Related publication This paper offers a survey of recent corpus-based work, which shows that dispersion is typically measured across the text files in a corpus. Systematic insights into the behavior of measures in such distributional settings are currently lacking, however. After a thorough discussion of six prominent indices, we investigate their behavior on relevant frequency distributions, which are designed to mimic actual corpus data. Our evaluation considers different distributional settings, i.e. various combinations of frequency and dispersion values. The primary focus is on the response of measures to relatively high and low sub-frequencies, i.e. texts in which the item or structure of interest is over- or underrepresented (if not absent). We develop a simple method for constructing sensitivity profiles, which allow us to draw instructive comparisons among measures. We observe that these profiles vary considerably across distributional settings. While D and DP appear to show the most balanced response contours, our findings suggest that much work remains to be done to understand the performance of measures on items with normalized frequencies below 100 per million words.

数据集描述:本数据集收录了Sönning(即将出版)研究的背景数据和补充材料,该研究探讨了在应用于文本级别频率数据时,分散度测度表现的行为。在该研究中报告的文献综述中,对基于语料库的工作中如何使用分散度测度进行了考察,其中包含了一份表格文件,列出了730篇被审查的研究文章,以及那些在语料库语言学(及词汇学)意义上测量分散度的研究的注释。至于用于训练该论文中报告的模拟研究 underlying statistical model parameters 的语料库数据,数据集包含了一个包含49,604个独特词形(转换为小写后)的词-文档矩阵,这些词形出现在Brown语料库中。此外,还包含了一些R脚本,详细说明了如何处理从自然语言工具包(Bird等人,2009年;https://www.nltk.org/)获取的Brown语料库XML文件,以产生这种数据排列。 摘要:相关出版物:本文对近期的基于语料库的工作进行了综述,指出分散度通常是在语料库的文本文件中测量的。然而,对于此类分布性设置中测度行为的系统洞察目前尚缺。在对六个突出的指标进行彻底讨论之后,我们研究了它们在相关频率分布上的行为,这些分布旨在模拟实际的语料库数据。我们的评估考虑了不同的分布性设置,即频率和分散度值的不同组合。主要关注点在于测度对相对高和低次频率的反应,即那些关注项或结构过度或不足表现(甚至缺失)的文本。我们开发了一种简单的方法来构建灵敏度轮廓,这使我们能够在对测度进行比较时获得有教育意义的见解。我们观察到,这些轮廓在分布性设置中存在显著差异。虽然D和DP似乎显示出最平衡的反应轮廓,但我们的发现表明,在理解测度在频率低于每百万词100次的项上的性能方面,还有许多工作要做。
提供机构:
doi.org
二维码
社区交流群
二维码
科研交流群
商业服务