全球语言使用语料库 (CGLU)
收藏arXiv2020-04-02 更新2024-06-21 收录
下载链接:
https://www.earthlings.io/corpus_download.html
下载链接
链接失效反馈官方服务:
资源简介:
全球语言使用语料库(CGLU)是由坎特伯雷大学语言学系创建的公开可用语料库,包含从147亿网页中提取的约4230亿字,涵盖148种语言和158个国家,每个语言和国家至少有100万字。该数据集旨在通过一致的数据收集方法,代表地区语言变体,并提供数据驱动的资源,以理解语言的使用地点。CGLU通过系统地比较语料库与人口统计基准数据,以及与基于Twitter的替代数据集进行三角测量,分析数字语言数据如何代表实际人口。该数据集的应用领域包括语言映射和语言识别模型的评估,旨在解决语言多样性和地理分布的问题。
The Corpus of Global Language Use (CGLU) is a publicly available corpus created by the Linguistics Department of the University of Canterbury. It contains approximately 423 billion words extracted from 14.7 billion web pages, covering 148 languages and 158 countries, with a minimum of 1 million words per language and country. This corpus aims to represent regional language varieties and provide data-driven resources for understanding the geographic distribution of language use via a consistent data collection methodology. CGLU analyzes how digital language data represents real-world populations by systematically comparing the corpus against demographic benchmark data and triangulating with Twitter-based alternative datasets. Applications of this corpus include language mapping and the evaluation of language identification models, with the goal of addressing issues related to linguistic diversity and geographic distribution.
提供机构:
坎特伯雷大学语言学系
创建时间:
2020-04-02



