blue-blues/c2
收藏Hugging Face2025-11-02 更新2025-11-03 收录
下载链接:
https://hf-mirror.com/datasets/blue-blues/c2
下载链接
链接失效反馈官方服务:
资源简介:
Common Crawl WET数据集-c2是一个从Common Crawl项目的WET文件派生出来的大规模过滤数据集。该数据集经过清理和汇总,用于自然语言处理任务,特别是用于大型语言模型(LLM)的预训练。数据集来源于2025年9月的Common Crawl CC-MAIN-2025-38爬取。数据类型为从网页爬取的WET文件中提取的纯文本,经过激进的元数据和模板过滤。文件大小约为每个15GB,以平衡上传大小和存储限制。预处理包括流式提取、元数据移除、过滤掉模板和重复内容。该数据集主要用于预训练基础模型和需要多样化、大规模自然语言语料库的大型语言模型。
The Common Crawl WET Dataset - c2 is a large-scale filtered dataset derived from the WET files of the Common Crawl project. It is cleaned and aggregated for natural language processing tasks, especially for pretraining large language models (LLMs). The dataset is sourced from the Common Crawl CC-MAIN-2025-38 crawl in September 2025. The data type consists of extracted plaintext from web crawl WET files, aggressively filtered for metadata and boilerplate. The file size is large combined files of approximately 15GB each to balance upload size and storage constraints. The preprocessing involves streamed extraction, metadata removal, and filtering out boilerplate and duplicate content. The dataset is primarily designed for pretraining foundation models and LLMs that require diverse, massive-scale natural language corpora.
提供机构:
blue-blues



