enche1561/Fineweb-Edu-Chinese-V2.2

Name: enche1561/Fineweb-Edu-Chinese-V2.2
Creator: enche1561
Published: 2026-03-08 07:43:28
License: 暂无描述

Hugging Face2026-03-08 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/enche1561/Fineweb-Edu-Chinese-V2.2

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - zh license: apache-2.0 task_categories: - text-generation - question-answering tags: - education - nlp - sft - synthetic - deepseek size_categories: - 10B<n<100B - 1M<n<10M pretty_name: Chinese Fineweb Edu V2.2 configs: - config_name: default data_files: - split: sft_qa path: sft/cleaned/*.jsonl - split: sft_context path: sft/*.jsonl - config_name: pretrain data_files: - split: score_4_5 path: 4_5/*.parquet - split: score_3_4 path: 3_4/*.parquet - split: score_2_3 path: 2_3/*.parquet --- # Chinese Fineweb Edu Dataset V2.2 (Instruct & Pre-train) <div align="center"> <a href="#chinese">[[中文]]</a> | <a href="#english">[[English]]</a> </div> <a id="english"></a> <div align="center"> <img width="600px" alt="OpenCSG" src="./logo.png"> [OpenCSG Community](https://opencsg.com/models) | [👾 GitHub](https://github.com/yuyijiong/fineweb-edu-chinese) | [📖 Technical Report](https://arxiv.org/abs/2501.08197) </div> ## Dataset Introduction: Filling the Data Puzzle for Chinese Education LLMs **Chinese Fineweb Edu Dataset V2.2** is a rare high-quality dataset in the open-source community that covers the full process from **Pre-training** to **Supervised Fine-Tuning (SFT)** for the Chinese education domain. This project aims to solve the core pain point of "scarcity of high-quality educational corpora" in the Chinese open-source community. Building on the massive pre-training data of V2.1, the V2.2 version leverages the powerful text understanding capabilities of **DeepSeek V3.2** to distill 1.43 million high-quality Q&A pairs from the top 0.1% of high-quality corpora, providing a standardized **Post-training** dataset for the community. --- ## Why Do We Need This Dataset? In current LLM research and development, the **"scarcity of high-quality post-training data"** has become the biggest bottleneck restricting the leap in model intelligence. ### 1. The Trap of "Models Taking Shortcuts" Current open-source SFT data (such as early Alpaca, ShareGPT) allows models to learn dialogue formats but often sacrifices factual accuracy. > **Core Arguments & Evidence:** > * **LIMA Hypothesis (Less Is More for Alignment)**: Meta AI research shows that the primary role of fine-tuning is "format alignment" rather than "learning new knowledge." Just 1,000 carefully selected high-quality samples can outperform 50,000 ordinary samples. This proves that **data purity is far more important than quantity**. > * Reference: [LIMA: Less Is More for Alignment (NeurIPS 2024)](https://arxiv.org/abs/2305.11206) > > * **The False Promise of Imitation Learning**: Research from UC Berkeley points out that training models with large amounts of low-quality SFT data only allows them to "imitate the style of proprietary models" without acquiring their logical reasoning capabilities. This results in models that are **"giants in style, but dwarfs in fact."** > * Reference: [The False Promise of Imitation Learning](https://arxiv.org/abs/2305.15717) ### 2. The Quality Crisis of Synthetic Data As more data is generated by AI, models will degrade if strict quality control is lacking. > **Core Arguments & Evidence:** > * **Model Collapse**: Rice University research found that if models are trained recursively on low-quality synthetic data, "Model Collapse" occurs, losing tail information of the distribution and leading to a loss of creativity and diversity. **The only way to avoid collapse is to use highly pure, textbook-quality synthetic sources.** > * Reference: [Self-Consuming Generative Models Go MAD](https://arxiv.org/abs/2307.01850) > > * **Lessons from AlpaGasus**: Researchers filtered out 90% of low-quality Alpaca data and trained a model with only 9,000 samples, which outperformed the model trained on the full dataset in various metrics. > * Reference: [AlpaGasus: Training A Better Alpaca with Fewer Data](https://arxiv.org/abs/2307.08701) ### Strategy of V2.2 Addressing the above industry pain points, V2.2 insists on **Quality Over Quantity**: 1. **Reject Low Quality**: Only the **Top 0.1%** of corpora with the highest scores are selected as seeds to avoid model collapse from the source. 2. **Reject Hallucinations**: Utilizing DeepSeek V3.2's powerful reading comprehension capabilities, Q&A pairs are generated strictly based on the provided `Context`. Unlike freely generated chat data, **every entry in our data has solid evidence in the original text.** --- ## Version Evolution & Comparison | Version | Positioning | Scale | Key Features & Improvements | Status | | :--- | :--- | :--- | :--- | :--- | | **V1.0** | **Proof of Concept** | ~90M Entries (300GB) | • Initial BERT scoring model • MinHash deduplication • Sources: CCI2, SkyPile, Tele-AI | 🔴 Deprecated | | **V2.0** | **Scale Up** | ~188M Entries (420B Tokens) | • **Upgraded Scorer**: OpenCSG csg-wukong-enterprise V2 • **Expanded Sources**: Industry2, wanjuan1.0, wudao | 🔴 Deprecated | | **V2.1** | **Pre-train Selection** | ~1.5T Tokens (Total) | • **Quality Stratification**: Archived by score (4-5, 3-4) • **New Sources**: map-cc, opencsg-cc • **Supports Curriculum Learning** | 🟢 **Recommended** (For Pre-train) | | **V2.2** | **SFT & Alignment** | **1.43M Pairs** High-Quality QA | • **DeepSeek V3.2 Synthesis**: Generated based on top-tier corpora • **Full Process**: Provides provenance source & pure fine-tuning versions • **Fact Anchoring**: Strictly faithful to original text | 🟢 **Recommended** (For SFT) | --- ## V2.2 SFT Dataset Details: Reshaping Post-Training This section is designed for **Instruction Tuning** scenarios. ### 1. SFT Data Specifications | Component | Volume | Disk Usage | Core Value & Application | | :--- | :--- | :--- | :--- | | **Pure QA SFT** | **1.43M Lines** | **3.4 GB** | **[Best for Fine-Tuning]** Contains only `Instruction` and `Output`. • **Usage**: Directly used for SFT in frameworks like LLaMA-Factory. • **Feature**: Lightweight format, focusing on allowing the model to learn high-quality answer paradigms distilled by DeepSeek. | | **Full Context SFT** | **1.43M Lines** | **14.6 GB** | **[Provenance & Backup]** Contains `Original Seed Text` + `Instruction` + `Output`. • **⚠️ Note**: **Not recommended** for direct SFT training (Tokens are too long and contain non-dialogue content). • **Usage**: Used only for data provenance. When the model behaves abnormally, you can trace back to this file to check if the original knowledge point is accurate. | ### 2. Sample & Analysis ```json { "instruction": "Who is the Swiss archaeologist Ferdinand Keller? What was his most famous archaeological discovery, and how does this help us understand ancient history?", "input": "", "output": "Ferdinand Keller (1800-1881) was an important Swiss archaeologist... His most famous discovery was the **Lake Dwellings** (Pfahlbauten) found in Switzerland in 1853...\n\nThe help of this discovery in understanding ancient history is mainly reflected in:\n1. **Lifestyle Reconstruction**: It revealed the dwelling forms of prehistoric humans in the Alpine region around 5000-500 BC.\n2. **Cultural Remains**: Due to the anoxic underwater environment, a large amount of organic matter (such as wood, food residues) was preserved, providing direct evidence for studying the technological level at that time..." } ``` > **Data Quality Analysis**: > When handling such tasks, DeepSeek V3.2 demonstrates strong information extraction and summarization capabilities. The generated answers contain not only Facts but also structured Explanations, which are not found in ordinary short-text SFT data. --- ## V2.1 Pre-train Dataset Details *(If you focus on base model pre-training, V2.1 remains one of the most granularly stratified Chinese education corpora.)* We recommend adopting a **Curriculum Learning** strategy, using data from different tiers at different training stages: * **Tier 1: Excellent Quality (Score 4-5) - [70 GB]** * **Positioning**: The model's "Textbook". * **Usage**: Recommended for the final stage of pre-training—**The Annealing Phase**. * **Technical Background**: According to technical reports from DeepSeek, Llama 3, etc., high-intensity training with high-quality, low-noise data at the end of training can significantly reduce model PPL and greatly improve instruction following ability. * **Tier 2: High Quality Content (Score 3-4) - [800 GB]** * **Positioning**: The model's "Supplementary Reading". * **Usage**: Main force data for the mid-stage of pre-training, building a broad worldview. * **Tier 3: Supplementary Corpora (Score 2-3) - [1.4 TB]** * **Positioning**: The model's "Social Knowledge". * **Usage**: Improving the model's linguistic robustness and ability to withstand noise. --- ## Quick Start Load directly using the Hugging Face `datasets` library: ```python from datasets import load_dataset # ------------------------------------------------------- # Scenario A: SFT Instruction Tuning # ------------------------------------------------------- # Load pure QA pairs (3.4GB), format is standard instruction/output ds_sft = load_dataset("OpenCSG/Chinese-Fineweb-Edu-V2.2", split="sft_qa") # ------------------------------------------------------- # Scenario B: Data Provenance & Backup # ------------------------------------------------------- # If you need to check which original article a QA pair was generated from, load sft_context # Note: Only for backup and reference, not recommended for direct training ds_context = load_dataset("OpenCSG/Chinese-Fineweb-Edu-V2.2", split="sft_context") # ------------------------------------------------------- # Scenario C: Base Model Pre-training # ------------------------------------------------------- # Load Score 4-5 high-quality pre-training corpus (Parquet format) ds_pretrain = load_dataset("OpenCSG/Chinese-Fineweb-Edu-V2.2", data_files="pretrain/score_4_5/*.parquet") ``` --- ## License Agreement & Citation **License**: OpenCSG Community License. The Chinese Fineweb Edu dataset supports commercial use. If you plan to use the OpenCSG model or its derivatives for commercial purposes, you must comply with the terms and conditions outlined in the OpenCSG Community License as well as the Apache 2.0 License. For commercial use, please send an email to `lorraineg@opencsg.com` and obtain permission. --- <a id="chinese"></a> <div align="center"> <img width="600px" alt="OpenCSG" src="./logo.png"> [OpenCSG 社区](https://opencsg.com/models) | [👾 GitHub](https://github.com/yuyijiong/fineweb-edu-chinese) | [📖 技术报告](https://arxiv.org/abs/2501.08197) </div> ## 数据集简介：填补中文教育大模型的数据拼图 **Chinese Fineweb Edu Dataset V2.2** 是目前开源界少有的、覆盖从 **预训练 (Pre-train)** 到 **微调 (SFT)** 全流程的高质量中文教育数据集。本项目旨在解决中文开源社区中“高质量教育语料稀缺”的核心痛点。V2.2 版本在 V2.1 海量预训练数据的基础上，利用 **DeepSeek V3.2** 强大的文本理解能力，从全网最优质的 0.1% 语料中蒸馏出 143 万条高质量问答对，为社区提供了一套标准化的**“后训练（Post-training）”**数据集。 --- ## 为什么我们需要这套数据？在当前的 LLM 研发中，**“高质量后训练数据的稀缺”**已成为制约模型智力跃升的最大瓶颈。 ### 1. "模型喜欢走捷径"的困局目前的开源 SFT 数据（如早期的 Alpaca, ShareGPT）虽然让模型学会了对话的格式，但往往牺牲了事实准确性。 > **核心论据与证明：** > * **LIMA 假设 (Less Is More for Alignment)**：Meta AI 的研究表明，微调的主要作用是“格式对齐”而非“学习新知识”。仅仅 1000 条精心挑选的高质量数据，其效果就能击败 50000 条普通数据。这证明了**数据的纯度远比数量重要**。 > * 参考论文：[LIMA: Less Is More for Alignment (NeurIPS 2024)](https://arxiv.org/abs/2305.11206) > > > * **模仿学习的虚假承诺**：UC Berkeley 研究指出，使用大量低质 SFT 数据训练模型，只能让模型学会“模仿专有模型的语气”，而无法习得其逻辑推理能力。这导致模型变成**“风格上的巨人，事实上的矮子”**。 > * 参考论文：[The False Promise of Imitation Learning](https://arxiv.org/abs/2305.15717) > > > > ### 2. 合成数据的质量危机随着越来越多的数据由 AI 生成，如果缺乏严格的质量控制，模型会出现退化。 > **核心论据与证明：** > * **模型坍塌**：Rice University 研究发现，如果在低质量的合成数据上循环训练，模型会发生“坍塌”，丢失分布的尾部信息，导致创造力和多样性丧失。**避免坍塌的唯一方法是使用高度纯净、接近教科书质量的合成源。** > * 参考论文：[Self-Consuming Generative Models Go MAD](https://arxiv.org/abs/2307.01850) > > > * **AlpaGasus 的启示**：研究者通过过滤掉 90% 的低质 Alpaca 数据，仅用 9000 条数据训练的模型，在各项指标上反而超越了全量数据训练的模型。 > * 参考论文：[AlpaGasus: Training A Better Alpaca with Fewer Data](https://arxiv.org/abs/2307.08701) > > > > ### V2.2 的应对策略针对上述行业痛点，V2.2 坚持 **质量至上**： 1. **拒绝低质**：只取全网评分最高的 **Top 0.1%** 语料作为种子，从源头避免模型坍塌。 2. **拒绝幻觉**：利用 DeepSeek V3.2 强大的阅读理解能力，严格基于 `Context` 生成问答。不同于自由生成的聊天数据，我们的数据**每一条都有确凿的原文依据**。 --- ## 版本演进与特性对比 | 版本号 | 核心定位 | 数据规模 | 关键特性与改进 | 当前状态 | | --- | --- | --- | --- | --- | | **V1.0** | **概念验证** | ~90M 条目 (300GB) | • 初代 BERT 打分模型 • 引入 MinHash 去重 • 数据源：CCI2, SkyPile, Tele-AI | 🔴 已弃用 | | **V2.0** | **规模化扩展** | ~188M 条目 (420B Tokens) | • **升级打分器**：OpenCSG csg-wukong-enterprise V2 • **扩展数据源**：Industry2, wanjuan1.0, wudao | 🔴 已弃用 | | **V2.1** | **预训练精选** | ~1.5T Tokens (总计) | • **质量分层**：按分数归档 (4-5分, 3-4分) • **新增源**：map-cc, opencsg-cc • **支持课程学习** | 🟢 **推荐** (预训练用) | | **V2.2** | **SFT与对齐** | **143.7万对** 高质量问答 | • **DeepSeek V3.2 合成**：基于顶级语料生成 • **全流程覆盖**：提供数据回溯源与纯净微调版 • **事实锚定**：回答内容严格忠实于原文 | 🟢 **推荐** (SFT用) | --- ## V2.2 SFT 数据集详情：重塑后训练流程本部分数据专为 **指令微调** 场景打造。 ### 1. SFT 数据规格表 | 数据组件 | 数据量 | 磁盘占用 | 核心价值与应用场景 | | --- | --- | --- | --- | | **Pure QA SFT** | **143.7 万条** | **3.4 GB** | **[微调首选]** 仅包含 `Instruction` 和 `Output`。 • **用途**：直接用于 LLaMA-Factory 等框架进行 SFT。 • **特点**：格式轻量，专注于让模型学习由 DeepSeek 提炼出的高质量回答范式。 | | **Full Context SFT** | **143.7 万条** | **14.6 GB** | **[数据回溯与备份]** 包含了 `原始种子文本` + `Instruction` + `Output`。 • **⚠️ 注意**：**不建议**直接用于 SFT 训练（Token 长度过长且包含非对话内容）。 • **用途**：仅作为数据溯源使用。当模型表现异常时，可回溯至该文件检查原始知识点是否准确。 | ### 2. 数据样例与分析 ```json { "instruction": "瑞士考古学家费迪南德·克勒是谁? 他最著名的考古发现是什么, 这对我们理解古代历史有什么帮助? ", "input": "", "output": "费迪南德·克勒 (Ferdinand Keller, 1800-1881) 是一位瑞士重要考古学家... 他最著名的考古发现是1853年在瑞士发现的**湖居遗址**（也称“干栏建筑”）...\n\n这一发现对理解古代历史的帮助主要体现在：\n1. **生活方式重建**：揭示了公元前5000-500年左右阿尔卑斯地区史前人类的居住形态。\n2. **文化遗存**：由于水下环境缺氧，大量有机物（如木材、食物残渣）得以保存，为研究当时的技术水平提供了直接证据..." } ``` > **数据质量分析**： > DeepSeek V3.2 在处理此类任务时，展现了极强的信息提取与总结能力。生成的回答不仅包含了事实，还包含了结构化的解释，这是普通短文本 SFT 数据所不具备的。 --- ## V2.1 Pre-train 数据集详情 (预训练篇) *(如果您专注于基座模型的预训练，V2.1 依然是目前质量分层最精细的中文教育语料之一)* 我们建议采用 **课程学习** 策略，在训练的不同阶段使用不同分层的数据： * **Tier 1: 卓越质量 (Score 4-5) - [70 GB]** * **定位**：模型的“教科书”。 * **用途**：建议在预训练的最后阶段——**退火阶段** 使用。 * **技术背景**：根据 DeepSeek、Llama 3 等技术报告，在训练末期使用高质量、低噪声的数据进行高强度训练，能显著降低模型的 PPL 并大幅提升指令遵循能力。 * **Tier 2: 优质内容 (Score 3-4) - [800 GB]** * **定位**：模型的“课外书”。 * **用途**：预训练中期的主力数据，构建广泛的世界观。 * **Tier 3: 补充语料 (Score 2-3) - [1.4 TB]** * **定位**：模型的“社会见闻”。 * **用途**：提升模型的语言鲁棒性和对抗噪声的能力。 --- ## 快速开始使用 Hugging Face `datasets` 库即可一键加载： ```python from datasets import load_dataset # ------------------------------------------------------- # 场景 A: SFT 指令微调 # ------------------------------------------------------- # 加载纯问答对数据 (3.4GB)，格式为标准 instruction/output ds_sft = load_dataset("OpenCSG/Chinese-Fineweb-Edu-V2.2", split="sft_qa") # ------------------------------------------------------- # 场景 B: 数据校验与回溯 # ------------------------------------------------------- # 如果需要查看某条问答是基于哪篇原始文章生成的，请加载 sft_context # 注意：仅用于备份和查阅，不建议直接训练 ds_context = load_dataset("OpenCSG/Chinese-Fineweb-Edu-V2.2", split="sft_context") # ------------------------------------------------------- # 场景 C: 基座模型预训练 # ------------------------------------------------------- # 加载 Score 4-5 的高质量预训练语料 (Parquet格式) ds_pretrain = load_dataset("OpenCSG/Chinese-Fineweb-Edu-V2.2", data_files="pretrain/score_4_5/*.parquet") ``` --- ## 📜 许可协议与引用 **许可协议**：OpenCSG Community License。本数据集支持商业用途。如果您计划将 OpenCSG 模型或其衍生产品用于商业目的，请遵守 OpenCSG 社区许可及 Apache 2.0 协议条款。如需用于商业产品，请务必发送邮件至 `lorraineg@opencsg.com` 进行报备并获取许可。

提供机构：

enche1561

5,000+

优质数据集

54 个

任务类型

进入经典数据集