Background data for: Advancing our understanding of dispersion measures in corpus research

Name: Background data for: Advancing our understanding of dispersion measures in corpus research
Creator: University of Bamberg
Published: 2025-07-17 00:00:00
License: 暂无描述

DataverseNO2025-07-17 更新2026-04-13 收录

下载链接：

https://dataverse.no/citation?persistentId=doi:10.18710/FVHTFM

下载链接

链接失效反馈

官方服务：

资源简介：

Dataset description This dataset contains background data and supplementary material for Sönning (forthcoming), a study that looks at the behavior of dispersion measures when applied to text-level frequency data. For the literature survey reported in that study, which examines how dispersion measures are used in corpus-based work, it includes tabular files listing the 730 research articles that were examined as well as annotations for those studies that measured dispersion in the corpus-linguistic (and lexicographic) sense. As for the corpus data that were used to train the statistical model parameters underlying the simulation study reported in that paper, the dataset contains a term-document matrix for the 49,604 unique word forms (after conversion to lower-case) that occur in the Brown Corpus. Further, R scripts are included that document in detail how the Brown Corpus XML files, which are available from the Natural Language Toolkit (Bird et al. 2009; https://www.nltk.org/), were processed to produce this data arrangement.

提供机构：

University of Bamberg

创建时间：

2023-06-28