statmt/cc100

Hugging Face2024-03-05 更新2024-06-15 收录

下载链接：

https://hf-mirror.com/datasets/statmt/cc100

下载链接

链接失效反馈

加速链接：

金山云加速下载

资源简介：

CC-100数据集是一个多语言文本数据集，包含了100多种语言的单语数据，并且还包括了一些罗马化语言的数据。该数据集是通过处理2018年1月至12月的Commoncrawl快照构建的，旨在用于训练XLM-R模型。数据集的主要用途是预训练语言模型和词表示。数据集的结构包括每个数据点的ID和文本内容，数据以段落形式呈现，文档之间用单个换行符分隔。

The CC-100 dataset is a multilingual text corpus that contains monolingual data of over 100 languages, as well as data in some romanized languages. It is constructed by processing Common Crawl snapshots from January to December 2018, and is intended for training the XLM-R model. Its primary applications are pre-training language models and word representations. The dataset structure includes the ID and text content of each data point, with the data presented in paragraph form and documents separated by a single newline character.

提供机构：

statmt

原始信息汇总

数据集卡片 for CC-100

数据集描述

数据集摘要

该语料库旨在重现用于训练 XLM-R 的数据集。该语料库包含 100 多种语言的单语数据，还包括罗马化语言的数据（以 *_rom 表示）。该数据集是通过处理 2018 年 1 月至 12 月的 Commoncrawl 快照，并使用 CC-Net 仓库提供的 URL 和段落索引构建的。

支持的任务和排行榜

CC-100 主要用于预训练语言模型和词表示。

语言

数据集中的语言包括：

af: 南非荷兰语 (305M)
am: 阿姆哈拉语 (133M)
ar: 阿拉伯语 (5.4G)
as: 阿萨姆语 (7.6M)
az: 阿塞拜疆语 (1.3G)
be: 白俄罗斯语 (692M)
bg: 保加利亚语 (9.3G)
bn: 孟加拉语 (860M)
bn_rom: 孟加拉语罗马化 (164M)
br: 布列塔尼语 (21M)
bs: 波斯尼亚语 (18M)
ca: 加泰罗尼亚语 (2.4G)
cs: 捷克语 (4.4G)
cy: 威尔士语 (179M)
da: 丹麦语 (12G)
de: 德语 (18G)
el: 希腊语 (7.4G)
en: 英语 (82G)
eo: 世界语 (250M)
es: 西班牙语 (14G)
et: 爱沙尼亚语 (1.7G)
eu: 巴斯克语 (488M)
fa: 波斯语 (20G)
ff: 富拉语 (3.1M)
fi: 芬兰语 (15G)
fr: 法语 (14G)
fy: 弗里斯兰语 (38M)
ga: 爱尔兰语 (108M)
gd: 苏格兰盖尔语 (22M)
gl: 加利西亚语 (708M)
gn: 瓜拉尼语 (1.5M)
gu: 古吉拉特语 (242M)
ha: 豪萨语 (61M)
he: 希伯来语 (6.1G)
hi: 印地语 (2.5G)
hi_rom: 印地语罗马化 (129M)
hr: 克罗地亚语 (5.7G)
ht: 海地克里奥尔语 (9.1M)
hu: 匈牙利语 (15G)
hy: 亚美尼亚语 (776M)
id: 印度尼西亚语 (36G)
ig: 伊博语 (6.6M)
is: 冰岛语 (779M)
it: 意大利语 (7.8G)
ja: 日语 (15G)
jv: 爪哇语 (37M)
ka: 格鲁吉亚语 (1.1G)
kk: 哈萨克语 (889M)
km: 高棉语 (153M)
kn: 卡纳达语 (360M)
ko: 韩语 (14G)
ku: 库尔德语 (90M)
ky: 吉尔吉斯语 (173M)
la: 拉丁语 (609M)
lg: 干达语 (7.3M)
li: 林堡语 (2.2M)
ln: 林加拉语 (2.3M)
lo: 老挝语 (63M)
lt: 立陶宛语 (3.4G)
lv: 拉脱维亚语 (2.1G)
mg: 马尔加什语 (29M)
mk: 马其顿语 (706M)
ml: 马拉雅拉姆语 (831M)
mn: 蒙古语 (397M)
mr: 马拉地语 (334M)
ms: 马来语 (2.1G)
my: 缅甸语 (46M)
my_zaw: 缅甸语 (Zawgyi) (178M)
ne: 尼泊尔语 (393M)
nl: 荷兰语 (7.9G)
no: 挪威语 (13G)
ns: 北索托语 (1.8M)
om: 奥罗莫语 (11M)
or: 奥里亚语 (56M)
pa: 旁遮普语 (90M)
pl: 波兰语 (12G)
ps: 普什图语 (107M)
pt: 葡萄牙语 (13G)
qu: 克丘亚语 (1.5M)
rm: 罗曼什语 (4.8M)
ro: 罗马尼亚语 (16G)
ru: 俄语 (46G)
sa: 梵语 (44M)
sc: 撒丁语 (143K)
sd: 信德语 (67M)
si: 僧伽罗语 (452M)
sk: 斯洛伐克语 (6.1G)
sl: 斯洛文尼亚语 (2.8G)
so: 索马里语 (78M)
sq: 阿尔巴尼亚语 (1.3G)
sr: 塞尔维亚语 (1.5G)
ss: 斯瓦蒂语 (86K)
su: 巽他语 (15M)
sv: 瑞典语 (21G)
sw: 斯瓦希里语 (332M)
ta: 泰米尔语 (1.3G)
ta_rom: 泰米尔语罗马化 (68M)
te: 泰卢固语 (536M)
te_rom: 泰卢固语罗马化 (79M)
th: 泰语 (8.7G)
tl: 他加禄语 (701M)
tn: 茨瓦纳语 (8.0M)
tr: 土耳其语 (5.4G)
ug: 维吾尔语 (46M)
uk: 乌克兰语 (14G)
ur: 乌尔都语 (884M)
ur_rom: 乌尔都语罗马化 (141M)
uz: 乌兹别克语 (155M)
vi: 越南语 (28G)
wo: 沃洛夫语 (3.6M)
xh: 科萨语 (25M)
yi: 意第绪语 (51M)
yo: 约鲁巴语 (1.1M)
zh-Hans: 简体中文 (14G)
zh-Hant: 繁体中文 (5.3G)
zu: 祖鲁语 (4.3M)

数据集结构

数据实例

am 配置的一个示例：

{id: 0, text: ተለዋዋጭ የግድግዳ አንግል ሙቅ አንቀሳቅሷል ቲ-አሞሌ አጥቅሼ ... }

每个数据点是一个文本段落。段落按原始（未打乱）顺序呈现。文档之间由一个包含单个换行符的数据点分隔。

数据字段

数据字段包括：

id: 示例的 id
text: 内容为字符串

数据分割

某些配置的大小：

名称	训练集大小
am	3124561
sr	35747957

数据集创建

数据来源

数据来自多种语言的网页。

注释

数据集不包含任何额外注释。

个人和敏感信息

由于数据集来自 Common Crawl，可能包含个人和敏感信息。在使用 CC-100 训练深度学习模型时，特别是文本生成模型，必须考虑这一点。

使用数据的注意事项

数据集的社会影响

[更多信息需补充]

偏见的讨论

[更多信息需补充]

其他已知限制

[更多信息需补充]

附加信息

数据集策展人

该数据集由爱丁堡大学统计机器翻译团队使用 Facebook Research 的 CC-Net 工具包准备。

许可信息

爱丁堡大学统计机器翻译团队不主张对语料库的准备工作拥有知识产权。使用该数据集时，您还必须遵守 Common Crawl 的使用条款。

引用信息

如果您发现该语料库中的资源有用，请引用以下内容：

bibtex @inproceedings{conneau-etal-2020-unsupervised, title = "Unsupervised Cross-lingual Representation Learning at Scale", author = "Conneau, Alexis and Khandelwal, Kartikay and Goyal, Naman and Chaudhary, Vishrav and Wenzek, Guillaume and Guzm{a}n, Francisco and Grave, Edouard and Ott, Myle and Zettlemoyer, Luke and Stoyanov, Veselin", editor = "Jurafsky, Dan and Chai, Joyce and Schluter, Natalie and Tetreault, Joel", booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics", month = jul, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2020.acl-main.747", doi = "10.18653/v1/2020.acl-main.747", pages = "8440--8451", abstract = "This paper shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks. We train a Transformer-based masked language model on one hundred languages, using more than two terabytes of filtered CommonCrawl data. Our model, dubbed XLM-R, significantly outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +14.6{%} average accuracy on XNLI, +13{%} average F1 score on MLQA, and +2.4{%} F1 score on NER. XLM-R performs particularly well on low-resource languages, improving 15.7{%} in XNLI accuracy for Swahili and 11.4{%} for Urdu over previous XLM models. We also present a detailed empirical analysis of the key factors that are required to achieve these gains, including the trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and low resource languages at scale. Finally, we show, for the first time, the possibility of multilingual modeling without sacrificing per-language performance; XLM-R is very competitive with strong monolingual models on the GLUE and XNLI benchmarks. We will make our code and models publicly available.", }

bibtex @inproceedings{wenzek-etal-2020-ccnet, title = "{CCN}et: Extracting High Quality Monolingual Datasets from Web Crawl Data", author = "Wenzek, Guillaume and Lachaux, Marie-Anne and Conneau, Alexis and Chaudhary, Vishrav and Guzm{a}n, Francisco and Joulin, Armand and Grave, Edouard", editor = "Calzolari, Nicoletta and B{e}chet, Fr{e}d{e}ric and Blache, Philippe and Choukri, Khalid and Cieri, Christopher and Declerck, Thierry and Goggi, Sara and Isahara, Hitoshi and Maegaard, Bente and Mariani, Joseph and Mazo, H{e}l{`e}ne and Moreno, Asuncion and Odijk, Jan and Piperidis, Stelios", booktitle = "Proceedings of the Twelfth Language Resources and Evaluation Conference", month = may, year = "2020", address = "Marseille, France", publisher = "European Language Resources Association", url = "https://aclanthology.org/2020.lrec-1.494", pages = "4003--4012", abstract = "Pre-training text representations have led to significant improvements in many areas of natural language processing. The quality of these models benefits greatly from the size of the pretraining corpora as long as its quality is preserved. In this paper, we describe an automatic pipeline to extract massive high-quality monolingual datasets from Common Crawl for a variety of languages. Our pipeline follows the data processing introduced in fastText (Mikolov et al., 2017; Grave et al., 2018), that deduplicates documents and identifies their language. We augment this pipeline with a filtering step to select documents that are close to high quality corpora like Wikipedia.", language = "English", ISBN = "979-10-95546-34-4", }

贡献

感谢 @abhishekkrthakur 添加此数据集。

搜集汇总

数据集介绍

构建方式

CC-100数据集的构建采用了对Common Crawl数据源的处理，通过选取2018年1月至12月的快照，利用CC-Net工具库对网页进行语言识别和文本提取，进而构建出包含100种以上语言的文本数据集。该数据集旨在重现用于训练XLM-R模型的数据集，为多语言语言模型的预训练提供了丰富的语料资源。

特点

CC-100数据集的特点在于其多语言性，涵盖了从非洲到亚洲的众多语言，包括一些罗马化的语言变体。数据集规模宏大，不同语言配置的训练集大小从数百万到数亿不等，这使得该数据集适用于大规模的语言模型训练。此外，数据集保留了文本的原始顺序，对文档结构进行了保留，有利于下游任务的迁移学习。

使用方法

使用CC-100数据集时，用户需要根据具体的任务需求选择相应的语言配置。数据集提供了训练集，可以用于语言模型的预训练任务，如文本生成和填空任务。用户可以通过HuggingFace提供的接口轻松加载和预处理数据，进而将其应用于模型训练。由于数据集来源于Common Crawl，可能包含个人和敏感信息，使用前需进行适当处理。

背景与挑战

背景概述

CC-100数据集，由爱丁堡大学统计机器翻译团队利用Facebook研究的CC-Net工具制作，旨在重构用于训练XLM-R模型的数据集。该数据集包含了超过100种语言的文本数据，涵盖了从非洲到亚洲的广泛语言，旨在为多语言语言模型的预训练提供丰富的语料。创建于2018年，CC-100数据集的构建理念来源于对Common Crawl快照中提供的URL和段落索引的处理，其规模和质量使其成为多语言自然语言处理任务中一个重要的资源。

当前挑战

尽管CC-100数据集为多语言研究提供了宝贵的资源，但在使用过程中也存在一些挑战。首先，由于数据来源于Common Crawl，可能包含个人和敏感信息，这要求研究者在训练模型时必须谨慎处理。其次，数据集的质量参差不齐，对低质量文本的过滤和高质量文本的筛选是构建过程中的一大挑战。此外，多语言数据集的标注和质量控制也是一个复杂的问题，因为它涉及到不同语言之间的差异和特性。

常用场景

经典使用场景

CC-100数据集广泛用于自然语言处理领域的预训练任务，其经典的使用场景包括跨语言模型训练、语言模型评估以及多语言信息检索等。该数据集提供了多种语言的文本数据，使得研究者能够在多语言环境中进行模型训练，从而构建出能够处理多种语言任务的通用模型。

衍生相关工作

基于CC-100数据集，学术界衍生出了一系列相关工作，包括跨语言表示学习、多语言文本分类、以及用于低资源语言处理的模型研究等。这些工作不仅推动了多语言自然语言处理技术的发展，也为相关领域的学术交流提供了宝贵的资源。

数据集最近研究