Colossal Clean Crawled Corpus (C4)

Name: Colossal Clean Crawled Corpus (C4)
Creator: 艾伦人工智能研究所
Published: 2021-10-01 01:20:01
License: 暂无描述

arXiv2021-10-01 更新2024-06-21 收录

下载链接：

https://github.com/allenai/c4

下载链接

链接失效反馈

官方服务：

资源简介：

Colossal Clean Crawled Corpus (C4)是由艾伦人工智能研究所创建的大型语言数据集，包含超过3.65亿个来自互联网的文档，总计超过1560亿个tokens。该数据集通过应用一系列过滤器从Common Crawl的单一快照中创建，旨在移除非自然英语文本。C4数据集已被用于训练如T5和Switch Transformer等大型预训练英语语言模型。数据集的创建过程涉及复杂的过滤和清洗步骤，以确保数据质量。C4数据集的应用领域广泛，包括自然语言处理任务的改进和语言模型的优化，旨在解决语言理解和生成中的复杂问题。

Colossal Clean Crawled Corpus (C4) is a large-scale language dataset created by the Allen Institute for AI, containing over 365 million internet documents with a total of more than 156 billion tokens. Constructed from a single snapshot of Common Crawl via a series of filtering operations, this dataset is designed to remove non-natural English texts. The C4 dataset has been utilized to train large pre-trained English language models such as T5 and Switch Transformer. Its creation process involves complex filtering and cleaning steps to ensure data quality. The C4 dataset has a wide range of application scenarios, including the improvement of natural language processing tasks and the optimization of language models, aiming to address complex issues in language understanding and generation.

提供机构：

艾伦人工智能研究所

创建时间：

2021-04-18

5,000+

优质数据集

54 个

任务类型

进入经典数据集