Background data for: Advancing our understanding of dispersion measures in corpus research

DataONE2025-07-16 更新2025-08-02 收录

下载链接：

https://search.dataone.org/view/sha256:3ca7cbd3baafb2267784bbc8922712f99666fe43ce5ffe1ac3c84448df22d996

下载链接

链接失效反馈

官方服务：

资源简介：

Dataset description This dataset contains background data and supplementary material for Sönning (forthcoming), a study that looks at the behavior of dispersion measures when applied to text-level frequency data. For the literature survey reported in that study, which examines how dispersion measures are used in corpus-based work, it includes tabular files listing the 730 research articles that were examined as well as annotations for those studies that measured dispersion in the corpus-linguistic (and lexicographic) sense. As for the corpus data that were used to train the statistical model parameters underlying the simulation study reported in that paper, the dataset contains a term-document matrix for the 49,604 unique word forms (after conversion to lower-case) that occur in the Brown Corpus. Further, R scripts are included that document in detail how the Brown Corpus XML files, which are available from the Natural Language Toolkit (Bird et al. 2009; https://www.nltk.org/), were processed to produce this data arrangement. Abstract: Related publication This paper offers a survey of recent corpus-based work, which shows that dispersion is typically measured across the text files in a corpus. Systematic insights into the behavior of measures in such distributional settings are currently lacking, however. After a thorough discussion of six prominent indices, we investigate their behavior on relevant frequency distributions, which are designed to mimic actual corpus data. Our evaluation considers different distributional settings, i.e. various combinations of frequency and dispersion values. The primary focus is on the response of measures to relatively high and low sub-frequencies, i.e. texts in which the item or structure of interest is over- or underrepresented (if not absent). We develop a simple method for constructing sensitivity profiles, which allow us to draw instructive comparisons among measures. We observe that these profiles vary considerably across distributional settings. While D and DP appear to show the most balanced response contours, our findings suggest that much work remains to be done to understand the performance of measures on items with normalized frequencies below 100 per million words.

数据集说明：本数据集包含索宁（Sönning，即将出版）相关研究的背景数据与补充材料，该研究聚焦离散度指标在文本层级频次数据上的表现。针对该研究中的文献综述部分——该部分探讨了离散度指标在语料库研究中的应用方式，本数据集附带表格文件，列出了本次调研覆盖的730篇研究论文，并为那些以语料库语言学（及词典学）视角开展离散度测量的研究添加了标注信息。针对该论文中模拟研究所依托的统计模型参数训练所用的语料库数据，本数据集包含布朗语料库（Brown Corpus）中出现的49,604个唯一词形（转换为小写后）的词-文档矩阵。此外，本数据集还提供了R脚本，详细记录了如何处理可从自然语言工具包（Natural Language Toolkit，NLTK，Bird等人2009年；https://www.nltk.org/）获取的布朗语料库XML文件，以生成上述数据结构。摘要：相关发表论文：本文对近期基于语料库的相关研究进行了综述，结果表明离散度通常通过语料库中的文本文件进行测量。然而，当前学界尚未系统掌握此类分布场景下各类指标的表现规律。在详细讨论六种主流离散度指标后，我们针对模拟真实语料库数据设计的相关频次分布展开了指标表现研究。本次评估考量了不同的分布场景，即频次与离散度值的多种组合。研究核心聚焦于各类指标对高低子频次的响应情况，也就是目标词项或结构出现频次过高或过低（若未完全缺失）的文本。我们提出了一种构建敏感度轮廓的简易方法，借此可对不同指标展开具有指导意义的对比分析。我们发现，不同分布场景下的敏感度轮廓差异显著。尽管D与DP指标呈现出最为均衡的响应曲线，但研究结果表明，针对每百万词归一化频次低于100的词项，学界仍需开展大量研究以充分理解相关指标的表现性能。

创建时间：

2025-07-17