five

Colossal Clean Crawled Corpus (C4)

收藏
arXiv2021-10-01 更新2024-06-21 收录
下载链接:
https://github.com/allenai/c4
下载链接
链接失效反馈
官方服务:
资源简介:
Colossal Clean Crawled Corpus (C4)是由艾伦人工智能研究所创建的大型语言数据集,包含超过3.65亿个来自互联网的文档,总计超过1560亿个tokens。该数据集通过应用一系列过滤器从Common Crawl的单一快照中创建,旨在移除非自然英语文本。C4数据集已被用于训练如T5和Switch Transformer等大型预训练英语语言模型。数据集的创建过程涉及复杂的过滤和清洗步骤,以确保数据质量。C4数据集的应用领域广泛,包括自然语言处理任务的改进和语言模型的优化,旨在解决语言理解和生成中的复杂问题。

Colossal Clean Crawled Corpus (C4) is a large-scale language dataset created by the Allen Institute for AI, containing over 365 million internet documents with a total of more than 156 billion tokens. Constructed from a single snapshot of Common Crawl via a series of filtering operations, this dataset is designed to remove non-natural English texts. The C4 dataset has been utilized to train large pre-trained English language models such as T5 and Switch Transformer. Its creation process involves complex filtering and cleaning steps to ensure data quality. The C4 dataset has a wide range of application scenarios, including the improvement of natural language processing tasks and the optimization of language models, aiming to address complex issues in language understanding and generation.
提供机构:
艾伦人工智能研究所
创建时间:
2021-04-18
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作