five

Replication Data for: Studying lexical dynamics and language change via generalized entropies – the problem of sample size

收藏
DataONE2019-04-09 更新2024-06-08 收录
下载链接:
https://search.dataone.org/view/sha256:8e132c331e8a982182b1b1d502e13292536fbabd4a5567fa5fc6445738d0bd9b
下载链接
链接失效反馈
官方服务:
资源简介:
Recently, it was demonstrated that generalized entropies of order α offer novel and important opportunities to quantify the similarity of symbol sequences. For the analysis of the statistical properties of natural languages, this is especially interesting since textual data are characterized by Zipf’s law, i.e. there are very few word types that occur very often (e.g. function words expressing grammatical relationships) and very many word types with a very low frequency (e.g. content words carrying most of the meaning of a sentence). Varying α makes it possible to magnify differences between different texts at specific scales of the corresponding word frequency spectrum. Here, this approach is systematically and empirically studied by analyzing the lexical dynamics of the German weekly news magazine “Der Spiegel” (consisting of approximately 365k articles and 237M words that were published between 1947 and 2017). We show that, analogous to most other measures in quantitative linguistics, similarity measures based on generalized entropies depend heavily on the sample size (i.e. text length). We argue that this makes it difficult to quantify lexical dynamics and language change and show that standard sampling approaches do not solve this problem. We discuss the consequences of the results for the statistical analysis of languages.
创建时间:
2023-11-22
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作