five

Background data for: Advancing our understanding of dispersion measures in corpus research

收藏
DataverseNO2025-07-17 更新2026-04-13 收录
下载链接:
https://dataverse.no/citation?persistentId=doi:10.18710/FVHTFM
下载链接
链接失效反馈
官方服务:
资源简介:
<p><b>Dataset description</b></p> <p>This dataset contains background data and supplementary material for Sönning (forthcoming), a study that looks at the behavior of dispersion measures when applied to text-level frequency data. For the literature survey reported in that study, which examines how dispersion measures are used in corpus-based work, it includes tabular files listing the 730 research articles that were examined as well as annotations for those studies that measured dispersion in the corpus-linguistic (and lexicographic) sense. As for the corpus data that were used to train the statistical model parameters underlying the simulation study reported in that paper, the dataset contains a term-document matrix for the 49,604 unique word forms (after conversion to lower-case) that occur in the Brown Corpus. Further, R scripts are included that document in detail how the Brown Corpus XML files, which are available from the Natural Language Toolkit (Bird et al. 2009; https://www.nltk.org/), were processed to produce this data arrangement.</p>
提供机构:
University of Bamberg
创建时间:
2023-06-28
二维码
社区交流群
二维码
科研交流群
商业服务