five

Evaluation of text-level measures of lexical dispersion

收藏
osf.io2023-06-30 更新2025-03-22 收录
下载链接:
https://osf.io/rpfb8
下载链接
链接失效反馈
官方服务:
资源简介:
This OSF project is associated with a study that evaluates the robustness of various dispersion measures. It builds on recent methodological work, which has argued that dispersion should be measured across linguistically meaningful units such as the text files constituting a corpus. This shift to texts as the unit of analysis raises new methodological issues, however. This paper sheds light on the robustness of different measures, i.e. whether they are (overly) sensitive to data situations that can arise when texts differ (considerably) in length. We identify weak spots in existing measures, and then propose modifications to DP- and DA-related indexes to effect more resistant estimators. Along with other measures, these are then evaluated using data drawn from the BNC. We observe that our modified variants perform at least as well as their original versions. We also find that Carroll’s D2 shows the same weakness as Juilland’s D, a noticeable sensitivity to the number of units that enter the analysis.

本OSF项目与一项评估多种散布度度量稳健性的研究相关联。该项目建立在最近的方法论研究之上,该研究认为散布度应跨语言上有意义的单元进行测量,如构成语料库的文本文件。然而,将文本作为分析单元的转变也提出了新的方法论问题。本文揭示了不同度量方法的稳健性,即它们是否对文本长度(显著)差异可能导致的数据情况(过度)敏感。我们识别了现有度量方法的弱点,并随后提出了对DP-和DA相关指数的修改,以实现更稳健的估计器。我们使用来自BNC的数据对这些度量方法进行了评估。我们观察到,我们修改后的变体至少与它们的原始版本表现相同。我们还发现,Carroll的D2与Juilland的D显示出相同的弱点,即对进入分析的单元数量的明显敏感性。
提供机构:
Center For Open Science
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作