Ultra-FineWeb

Name: Ultra-FineWeb
Creator: maas
Published: 2026-05-24 02:23:18
License: 暂无描述

魔搭社区2026-05-24 更新2025-05-17 收录

下载链接：

https://modelscope.cn/datasets/OpenBMB/Ultra-FineWeb

下载链接

链接失效反馈

官方服务：

资源简介：

# Ultra-FineWeb <div align="center"> <img src="assets/ultra-fineweb-logo.png" width="600"/> </div>  <div align="center"> [📜 Ultra-FineWeb Technical Report](https://arxiv.org/abs/2505.05427) | [📄 MiniCPM4 Paper](https://huggingface.co/papers/2506.07900) | [💻 GitHub Repository](https://github.com/openbmb/minicpm) | [🌐 MiniCPM4 Project Page](https://huggingface.co/collections/openbmb/minicpm-4-6841ab29d180257e940baa9b) </div> ## 📚 Introduction Ultra-FineWeb is a **large-scale, high-quality, and efficiently-filtered dataset**. We use the proposed efficient verification-based high-quality filtering pipeline to the FineWeb and Chinese FineWeb datasets (source data from Chinese FineWeb-edu-v2, which includes IndustryCorpus2, MiChao, WuDao, SkyPile, WanJuan, ChineseWebText, TeleChat, and CCI3), resulting in the creation of higher-quality Ultra-FineWeb-en with approximately 1T tokens, and Ultra-FineWeb-zh datasets with approximately 120B tokens, collectively referred to as Ultra-FineWeb. ***Ultra-FineWeb*** serves as a core pre-training web dataset for the [MiniCPM4 Series](https://huggingface.co/collections/openbmb/minicpm-4-6841ab29d180257e940baa9b) models. - [Ultra-FineWeb](https://huggingface.co/datasets/openbmb/Ultra-FineWeb): Ultra-FineWeb, a **large-scale, high-quality, and efficiently-filtered dataset**, with 1T English tokens and 120B Chinese tokens. (**<-- you are here**) - [Ultra-FineWeb-classifier](https://huggingface.co/openbmb/Ultra-FineWeb-classifier): Ultra-FineWeb classifier, for filtering high-quality data from web corpora. ## 📢 What's New - **[2025.05.09]** **Ultra-FineWeb** technical report is available on [arXiv](https://arxiv.org/abs/2505.05427). 🔥🔥🔥 - **[2025.05.15]** **Ultra-FineWeb** tops the Hugging Face Datasets Trending list, reaching the #1 spot! ⭐️⭐️⭐️ - **[2025.06.06]** **Ultra-FineWeb-en** and **Ultra-FineWeb-zh** datasets are now available on Hugging Face, released alongside the [MiniCPM4 Series](https://huggingface.co/collections/openbmb/minicpm-4-6841ab29d180257e940baa9b) models. - **[2025.06.16]** The **Ultra-FineWeb-classifier** is now available on Hugging Face: [openbmb/Ultra-FineWeb-classifier](https://huggingface.co/openbmb/Ultra-FineWeb-classifier). - **[2025.12.10]** [***Ultra-FineWeb-en-v1.4***](https://huggingface.co/datasets/openbmb/Ultra-FineWeb/tree/main/data/ultrafineweb_en_v1_4) is released! 2.2T tokens fully open-sourced! Built on [FineWeb-v1.4](https://huggingface.co/datasets/HuggingFaceFW/fineweb), incorporating CommonCrawl snapshots from Apr 2024 - Jun 2025 to capture the latest world knowledge. 🚀🚀🚀 ## 💡 Highlights > **Abstract:** Data quality has become a key factor in enhancing model performance with the rapid development of large language models (LLMs). Model-driven data filtering has increasingly become a primary approach for acquiring high-quality data. However, it still faces two main challenges: (1) the lack of an efficient data verification strategy makes it difficult to provide timely feedback on data quality; and (2) the selection of seed data for training classifiers lacks clear criteria and relies heavily on human expertise, introducing a degree of subjectivity. To address the first challenge, we introduce an efficient verification strategy that enables rapid evaluation of the impact of data on LLM training with minimal computational cost. To tackle the second challenge, we build upon the assumption that high-quality seed data is beneficial for LLM training, and by integrating the proposed verification strategy, we optimize the selection of positive and negative samples and propose an efficient data filtering pipeline. This pipeline not only improves filtering efficiency, classifier quality, and robustness, but also significantly reduces experimental and inference costs. In addition, to efficiently filter high-quality data, we employ a lightweight classifier based on *fastText*, and successfully apply the filtering pipeline to two widely-used pre-training corpora, *FineWeb* and *Chinese FineWeb* datasets, resulting in the creation of the higher-quality ***Ultra-FineWeb*** dataset. ***Ultra-FineWeb*** contains approximately 1 trillion (T) English tokens and 120 billion (B) Chinese tokens. Empirical results demonstrate that the LLMs trained on Ultra-FineWeb exhibit significant performance improvements across multiple benchmark tasks, validating the effectiveness of our pipeline in enhancing both data quality and training efficiency. <div align="center"> <img src="assets/ultra-fineweb-pipeline.png" width="600"/> </div> - **Efficient Verification Strategy:** We propose a computationally efficient verification strategy that enables rapid evaluation of the impact of data on LLM training performance with minimal computational cost, significantly improving the efficiency of high-quality data filtering experiments. - **Large-Scale High-Quality Pre-training Datasets:** We design and implement an efficient high-quality data filtering pipeline, applied to the FineWeb and Chinese FineWeb datasets, resulting in the creation of higher-quality datasets, which can facilitate high-quality LLM training. - **Lightweight Classifier:** The Ultra-FineWeb classifier significantly reduces inference costs, achieving superior performance on extracted text from the same data source, thus validating the effectiveness of our proposed data filtering pipeline in enhancing data quality and training efficiency. ## 📈 Evaluation Results We utilize the MiniCPM-1.2B model architecture with the MiniCPM3-4B tokenizer. Each experiment involves training on 100B tokens, allowing for comprehensive data performance validation within computationally efficient parameters. We employ Lighteval library for model evaluation, adopt 11 benchmarks to evaluate the performance of trained models, and all evaluation metrics are based on a zero-shot setting. The evaluation metrics include: - **English benchmarks:** MMLU, ARC-C, ARC-E, CommonSenseQA, HellaSwag, OpenbookQA, PIQA, SIQA, and Winogrande. - **Chinese benchmarks:** C-Eval and CMMLU. Detailed evaluation results are reported below: - **Individual data experiments.** We perform isolated training runs using single datasets, facilitating direct comparisons between differently processed data from identical sources. <img src="assets/individual-english-table.png" alt="Individual English Table" width="75%"> <img src="assets/individual-chinese-table.png" alt="Individual Chinese Table" width="75%"> <img src="assets/individual-plot.png" alt="Individual Plot" width="100%"> - **Mixed Data Experiments.** We use a mix of 60% English data, 30% Chinese data, and 10% code data (StarCoder-v2). <img src="assets/mix-table.png" alt="Mix Table" width="75%"> <img src="assets/mix-plot.png" alt="Mix Plot" width="100%"> - **Loss and Performance Estimation Results.** We use the performance estimation methods proposed in [Densing Law](https://arxiv.org/abs/2412.04315) for further analysis and verification of the effectiveness of Ultra-FineWeb. <img src="assets/densing-law-table.png" alt="Densing Law Table" width="75%"> <img src="assets/densing-law-plot.png" alt="Densing Law Plot" width="100%"> ## ❤️ Acknowledgements - The ***Ultra-FineWeb classifier*** is built based on [fastText](https://fasttext.cc/). - The ***Ultra-FineWeb-en dataset*** is built based on [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb). - The ***Ultra-FineWeb-zh dataset*** is constructed based on [IndustryCorpus2](https://huggingface.co/datasets/BAAI/IndustryCorpus2), [MiChao](https://opendatalab.com/OpenDataLab/MiChao), [WuDao](https://data.baai.ac.cn/details/WuDaoCorporaText), [SkyPile](https://huggingface.co/datasets/Skywork/SkyPile-150B), [WanJuan](https://opendatalab.com/OpenDataLab/WanJuanCC), [ChineseWebText](https://huggingface.co/datasets/CASIA-LM/ChineseWebText2.0), [TeleChat](https://huggingface.co/datasets/Tele-AI/TeleChat-PTD), and [CCI3](https://huggingface.co/datasets/BAAI/CCI3-Data). Thanks for their awesome work! Open-source contributions make Ultra-FineWeb possible! 🙌 ## 🌟 Citation If you find our work useful, please consider citing: ```bibtex @misc{wang2025ultrafineweb, title={{Ultra-FineWeb}: Efficient Data Filtering and Verification for High-Quality LLM Training Data}, author={Yudong Wang and Zixuan Fu and Jie Cai and Peijun Tang and Hongya Lyu and Yewei Fang and Zhi Zheng and Jie Zhou and Guoyang Zeng and Chaojun Xiao and Xu Han and Zhiyuan Liu}, year={2025}, eprint={2505.05427}, archivePrefix={arXiv}, primaryClass={cs.CL}, } ``` And the main paper where Ultra-FineWeb is used: ```bibtex @article{minicpm4, title={MiniCPM4: Ultra-Efficient LLMs on End Devices}, author={MiniCPM Team}, year={2025} } ``` ## 💳 License This project is released under the [Apache 2.0](./LICENSE). Please note that since ***Ultra-FineWeb*** is built using multiple datasets, users should check the **LICENSE of each dataset individually** to ensure proper usage and compliance.

# Ultra-FineWeb <div align="center"> <img src="assets/ultra-fineweb-logo.png" width="600"/> </div>  <div align="center"> [📜 Ultra-FineWeb 技术报告](https://arxiv.org/abs/2505.05427) | [📄 MiniCPM4 论文](https://huggingface.co/papers/2506.07900) | [💻 GitHub 仓库](https://github.com/openbmb/minicpm) | [🌐 MiniCPM4 项目页面](https://huggingface.co/collections/openbmb/minicpm-4-6841ab29d180257e940baa9b) </div> ## 📚 数据集简介 Ultra-FineWeb是一款**大规模、高质量且经过高效过滤的数据集**。我们将自研的基于验证的高效高质量数据过滤流水线，应用于FineWeb与Chinese FineWeb数据集（其源数据来自Chinese FineWeb-edu-v2，涵盖IndustryCorpus2、MiChao、WuDao、SkyPile、WanJuan、ChineseWebText、TeleChat与CCI3），最终构建得到质量更优的Ultra-FineWeb-en（约含1万亿Token）与Ultra-FineWeb-zh（约含1200亿Token）数据集，二者统称为Ultra-FineWeb。Ultra-FineWeb是MiniCPM4系列模型的核心预训练网页数据集。 - [Ultra-FineWeb](https://huggingface.co/datasets/openbmb/Ultra-FineWeb): Ultra-FineWeb，一款**大规模、高质量且经过高效过滤的数据集**，包含1万亿英文Token与1200亿中文Token。(**<-- 当前页面**) - [Ultra-FineWeb-classifier](https://huggingface.co/openbmb/Ultra-FineWeb-classifier): Ultra-FineWeb分类器，用于从网页语料库中筛选高质量数据。 ## 📢 最新进展 - **[2025.05.09]** **Ultra-FineWeb** 技术报告已在 [arXiv](https://arxiv.org/abs/2505.05427) 上线。🔥🔥🔥 - **[2025.05.15]** **Ultra-FineWeb** 登顶Hugging Face数据集趋势榜单，斩获榜首！⭐️⭐️⭐️ - **[2025.06.06]** **Ultra-FineWeb-en** 与 **Ultra-FineWeb-zh** 数据集现已在Hugging Face上线，与[MiniCPM4系列](https://huggingface.co/collections/openbmb/minicpm-4-6841ab29d180257e940baa9b)模型同步发布。 - **[2025.06.16]** **Ultra-FineWeb-classifier** 现已在Hugging Face上线：[openbmb/Ultra-FineWeb-classifier](https://huggingface.co/openbmb/Ultra-FineWeb-classifier). - **[2025.12.10]** [***Ultra-FineWeb-en-v1.4***](https://huggingface.co/datasets/openbmb/Ultra-FineWeb/tree/main/data/ultrafineweb_en_v1_4) 正式发布！2.2万亿Token全面开源！该数据集基于[FineWeb-v1.4](https://huggingface.co/datasets/HuggingFaceFW/fineweb)构建，纳入了2024年4月至2025年6月的CommonCrawl快照，以覆盖最新的全球知识。🚀🚀🚀 ## 💡 核心亮点 > **摘要：** 随着大语言模型（Large Language Model, LLM）的快速发展，数据质量已成为提升模型性能的核心因素。基于模型驱动的数据筛选已逐渐成为获取高质量数据的主流手段，但仍面临两大核心挑战：其一，缺乏高效的数据验证策略，难以对数据质量提供及时反馈；其二，分类器训练所需的种子数据选择缺乏明确标准，过度依赖人工专业知识，引入了一定程度的主观性。为解决第一个挑战，我们提出了一种高效验证策略，能够以极低的计算成本快速评估数据对大语言模型训练的影响。针对第二个挑战，我们基于“高质量种子数据有利于大语言模型训练”的假设，结合自研的验证策略优化正负样本选择，并提出了高效数据过滤流水线。该流水线不仅提升了过滤效率、分类器质量与鲁棒性，还大幅降低了实验与推理成本。此外，为高效筛选高质量数据，我们采用了基于*fastText*的轻量级分类器，并将该过滤流水线成功应用于两款主流预训练语料库——FineWeb与Chinese FineWeb数据集，最终构建得到质量更优的***Ultra-FineWeb***数据集。该数据集包含约1万亿英文Token与1200亿中文Token。实验结果表明，基于Ultra-FineWeb训练的大语言模型在多项基准任务中均实现了显著性能提升，验证了我们的流水线在提升数据质量与训练效率方面的有效性。 <div align="center"> <img src="assets/ultra-fineweb-pipeline.png" width="600"/> </div> - **高效验证策略：** 我们提出了一种计算高效的验证策略，能够以极低的计算成本快速评估数据对大语言模型训练性能的影响，大幅提升了高质量数据过滤实验的效率。 - **大规模高质量预训练数据集：** 我们设计并实现了一套高效的高质量数据过滤流水线，将其应用于FineWeb与Chinese FineWeb数据集，构建得到质量更优的数据集，可支撑高质量大语言模型的训练。 - **轻量级分类器：** Ultra-FineWeb分类器大幅降低了推理成本，在同源提取文本上实现了更优的性能，验证了我们提出的数据过滤流水线在提升数据质量与训练效率方面的有效性。 ## 📈 评估结果我们采用MiniCPM-1.2B模型架构与MiniCPM3-4B分词器。每项实验均基于1000亿Token进行训练，在计算资源可控的参数范围内实现对数据性能的全面验证。我们使用Lighteval库开展模型评估，采用11项基准任务对训练后的模型性能进行评测，所有评估指标均基于零样本（Zero-shot）设置。评估指标包括： - **英文基准任务：** MMLU、ARC-C、ARC-E、CommonSenseQA、HellaSwag、OpenbookQA、PIQA、SIQA与Winogrande。 - **中文基准任务：** C-Eval与CMMLU。详细评估结果如下： - **单数据集实验：** 我们针对单个数据集开展独立训练，便于直接对比同源数据经不同处理后的效果。 <img src="assets/individual-english-table.png" alt="单数据集英文结果表" width="75%"> <img src="assets/individual-chinese-table.png" alt="单数据集中文结果表" width="75%"> <img src="assets/individual-plot.png" alt="单数据集性能曲线" width="100%"> - **混合数据集实验：** 我们采用60%英文数据、30%中文数据与10%代码数据（StarCoder-v2）的混合比例开展实验。 <img src="assets/mix-table.png" alt="混合数据集结果表" width="75%"> <img src="assets/mix-plot.png" alt="混合数据集性能曲线" width="100%"> - **损失与性能预估结果：** 我们采用[Densing Law](https://arxiv.org/abs/2412.04315)中提出的性能预估方法，进一步分析并验证Ultra-FineWeb的有效性。 <img src="assets/densing-law-table.png" alt="Densing Law结果表" width="75%"> <img src="assets/densing-law-plot.png" alt="Densing Law性能曲线" width="100%"> ## ❤️ 致谢 - ***Ultra-FineWeb分类器*** 基于[fastText](https://fasttext.cc/)构建。 - ***Ultra-FineWeb-en数据集*** 基于[FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb)构建。 - ***Ultra-FineWeb-zh数据集*** 基于[IndustryCorpus2](https://huggingface.co/datasets/BAAI/IndustryCorpus2)、[MiChao](https://opendatalab.com/OpenDataLab/MiChao)、[WuDao](https://data.baai.ac.cn/details/WuDaoCorporaText)、[SkyPile](https://huggingface.co/datasets/Skywork/SkyPile-150B)、[WanJuan](https://opendatalab.com/OpenDataLab/WanJuanCC)、[ChineseWebText](https://huggingface.co/datasets/CASIA-LM/ChineseWebText2.0)、[TeleChat](https://huggingface.co/datasets/Tele-AI/TeleChat-PTD)与[CCI3](https://huggingface.co/datasets/BAAI/CCI3-Data)构建。感谢所有开源贡献者的卓越工作！正是开源精神让Ultra-FineWeb得以实现！🙌 ## 🌟 引用规范如果您认为我们的工作对您有所帮助，请引用以下文献： bibtex @misc{wang2025ultrafineweb, title={{Ultra-FineWeb}: Efficient Data Filtering and Verification for High-Quality LLM Training Data}, author={Yudong Wang and Zixuan Fu and Jie Cai and Peijun Tang and Hongya Lyu and Yewei Fang and Zhi Zheng and Jie Zhou and Guoyang Zeng and Chaojun Xiao and Xu Han and Zhiyuan Liu}, year={2025}, eprint={2505.05427}, archivePrefix={arXiv}, primaryClass={cs.CL}, } 以及使用Ultra-FineWeb的核心论文： bibtex @article{minicpm4, title={MiniCPM4: Ultra-Efficient LLMs on End Devices}, author={MiniCPM Team}, year={2025} } ## 💳 许可证本项目基于[Apache 2.0](./LICENSE)协议开源。请注意，由于***Ultra-FineWeb***整合了多个数据集，使用者需单独查阅每个数据集的许可证条款，以确保合规使用。

提供机构：

maas

创建时间：

2025-05-15

5,000+

优质数据集

54 个

任务类型

进入经典数据集