five

Data and code from: Statistical structure and the evolution of languages

收藏
NIAID Data Ecosystem2026-05-10 收录
下载链接:
http://datadryad.org/dataset/doi%253A10.5061%252Fdryad.pg4f4qs44
下载链接
链接失效反馈
官方服务:
资源简介:
Human cultural development is marked by the emergence of new words and ideas, reflecting societal changes. But how does this evolution proceed? We use modern methods in natural language processing (namely, word embeddings) to measure statistical traces of cultural development, providing a testing ground to compare different models as to how this process works. We show that real embeddings of English and 21 other languages exhibit a series of previously unrecognized regularities, specifically (a) frequency assortativity, where entities of high popularity cluster near other high-popularity entities, (b) characteristic clustering velocity profiles due to aggregation into hierarchical structures, (c) persistent temporal dynamics, where newly-created entities appear disproportionately near other recent entries, and (d) Taylor’s law, implying that over time and across empirical semantic space the variance in new entity counts scales as a power of the mean, which helps systematize and quantify large historical fluctuations of neologisms. To explain these facts, we propose a class of generative models (specifically, directed preferential placement) that construct synthetic embeddings exhibiting similar regularities. We show that analogous regularities also occur in other data sets, suggesting that such generating models may shed light on new aspects of language and cultural evolution.
创建时间:
2026-01-21
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作