five

Data from: Robust clustering of languages across Wikipedia growth

收藏
DataONE2017-09-19 更新2024-06-26 收录
下载链接:
https://search.dataone.org/view/null
下载链接
链接失效反馈
官方服务:
资源简介:
Wikipedia is the largest existing knowledge repository that is growing on a genuine crowdsourcing support. While the English Wikipedia is the most extensive and the most researched one with over 5 million articles, comparatively little is known about the behaviour and growth of the remaining 283 smaller Wikipedias, the smallest of which, Afar, has only one article. Here, we use a subset of these data, consisting of 14 962 different articles, each of which exists in 26 different languages, from Arabic to Ukrainian. We study the growth of Wikipedias in these languages over a time span of 15 years. We show that, while an average article follows a random path from one language to another, there exist six well-defined clusters of Wikipedias that share common growth patterns. The make-up of these clusters is remarkably robust against the method used for their determination, as we verify via four different clustering methods. Interestingly, the identified Wikipedia clusters have little correlation with language families and groups. Rather, the growth of Wikipedia across different languages is governed by different factors, ranging from similarities in culture to information literacy.

维基百科(Wikipedia)是当前现存规模最大的知识宝库,其成长依托纯粹的众包模式支撑。英语维基百科是其中规模最庞大、研究最深入的版本,收录条目超500万条;而其余283个小型维基百科的运行机制与增长态势却鲜为人知,其中最小的阿法尔语维基百科仅包含1条条目。本研究采用该数据集的一个子集,该子集包含14962条不同条目,每条条目均覆盖从阿拉伯语到乌克兰语在内的26种语言版本。我们针对上述26种语言的维基百科展开了长达15年的增长态势分析。研究发现,尽管单条条目在不同语言版本间呈现随机迁移路径,但存在六个特征明确的维基百科集群,它们拥有相似的增长模式。通过四种不同的聚类方法验证可知,这些集群的构成具有极强的稳健性,不受聚类方法选择的影响。值得注意的是,所识别出的维基百科集群与语言谱系及语系组别几乎不存在相关性。换言之,不同语言版本维基百科的增长态势受多重因素共同驱动,涵盖文化相似性、信息素养水平等多个维度。
创建时间:
2017-09-19
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作