five

Fineweb-Edu-Chinese-V2.2

收藏
魔搭社区2026-05-17 更新2026-05-03 收录
下载链接:
https://modelscope.cn/datasets/opencsg/Fineweb-Edu-Chinese-V2.2
下载链接
链接失效反馈
官方服务:
资源简介:
# Chinese Fineweb Edu Dataset V2.2 (Instruct & Pre-train) <div align="center"> <img width="600px" alt="OpenCSG" src="./logo.png"> [OpenCSG 社区](https://opencsg.com/models) | [👾 GitHub](https://github.com/yuyijiong/fineweb-edu-chinese) | [📖 技术报告](https://arxiv.org/abs/2501.08197) </div> ## 📌 数据集简介:填补中文教育大模型的数据拼图 **Chinese Fineweb Edu Dataset** 是一个专为中文教育领域大模型(LLM)开发的高质量数据集系列。 ### 🌟 行业背景与痛点 在当前的大模型研发中,高质量的中文教育语料依然是稀缺资源。 1. **通用语料的局限**:常见的 Common Crawl 等数据集虽然规模巨大,但包含大量广告、碎片化信息和低质文本,直接用于教育模型训练会导致严重的“幻觉”和逻辑混乱。 2. **SFT 数据的匮乏**:业界开源的指令微调(SFT)数据多集中于通用对话或代码能力,**极度缺乏经过严谨筛选、具备教学逻辑的垂直领域(Domain-Specific)教育问答数据**。这导致许多模型虽然能说话,但无法像老师一样“传道授业”。 ### 🚀 V2.2 的里程碑意义 (2026/01) 为了解决上述痛点,我们在 V2.1 海量预训练数据的基础上,正式推出了 **V2.2 SFT 增强版**。 我们不仅提供“燃料”(预训练数据),现在更提供“引擎图纸”(SFT 数据)。通过引入 **DeepSeek V3.2** 的高阶推理能力,我们将静态的教科书级文本转化为动态的、具有多轮逻辑的问答对。这标志着 Chinese Fineweb Edu 从单一的语料库,进化为覆盖 **Pre-train (知识注入)** 到 **Post-train (行为对齐)** 全流程的完备数据解决方案。 --- ## 🔥 V2.2 SFT 数据集详情:重塑后训练流程 本部分数据专为 **指令微调 (Instruction Tuning)**、**对齐 (Alignment)** 及 **RAG (检索增强生成)** 场景打造。 ### 1. 为什么我们需要高质量的 SFT 数据? 后训练(Post-training)是赋予大模型“交互能力”和“领域专精”的关键阶段。 * **激发推理能力**:仅靠预训练,模型拥有知识但不懂如何运用。高质量的 SFT 数据能教会模型如何根据上下文进行推导、归纳和演绎。 * **降低幻觉率**:在教育场景下,准确性是红线。我们通过“基于原文(Grounding)”生成的问答对,强制模型学习“依据事实回答”,而非凭空捏造。 * **模拟教学逻辑**:DeepSeek V3.2 生成的数据不仅仅是简单的 Fact-Checking,更包含了“解释性”和“引导性”,模拟人类教师的思维链条(Chain of Thought)。 ### 2. 核心构建策略:从 1.4TB 到 150万 的精炼之路 高质量 SFT 数据的核心在于:**Garbage In, Garbage Out(垃圾进,垃圾出)**。为了确保 DeepSeek 生成出最优质的问答,我们对输入端的“种子数据”进行了极其严苛的筛选。 #### 第一步:打分函数的“过滤网”作用 我们沿用了 V2 版本中训练的高精度 **Educational Reward Model (教育奖励模型)**。该模型并非简单的关键词匹配,而是基于文本的**知识密度、逻辑连贯性、学术严谨度**进行 0-5 分的打分。 * **Score < 2**:通常为口语化严重、信息量低或广告营销内容。 * **Score > 4**:通常为教科书章节、学术论文摘要、深度科普文章或高质量百科条目。 #### 第二步:为什么是 150 万?(Top-Tier Selection) 在数亿条数据中,我们并非随机抽取,而是直接截取了 **Score 排名最高的 Top 1,500,000 条** 数据作为种子。 * **Top 150万 代表了什么?** 这意味着我们只选用了整个语料库中 **金字塔尖约 0.1% - 0.5%** 的内容。 * **种子数据画像**:这 150 万条数据主要由以下内容构成: * **STEM 领域**:数学公理推导、物理化学原理详解、生物演化过程描述。 * **人文社科**:历史事件的深度解析、文学作品赏析、哲学逻辑辩证。 * **百科知识**:定义清晰、结构完整的概念解释。 * **教育材料**:教案设计、知识点总结、难点解析。 #### 第三步:DeepSeek V3.2 的逻辑重构 我们将这 150 万条“金牌种子”输入 DeepSeek V3.2,利用其强大的指令遵循能力,执行以下任务: 1. **理解上下文**:深度阅读原始文本。 2. **提取核心考点**:识别文本中通过“问题”可以考察的知识点。 3. **构建教学对话**:生成 Questions (提问) 和 Answers (回答)。DeepSeek 被特别提示要生成**“详尽、有逻辑步骤”**的回答,而非简短的 Yes/No。 ### 3. 数据规格表 | 数据组件 | 数据量 | 磁盘占用 | 核心价值 | | --- | --- | --- | --- | | **Full Context SFT** | **143.7 万条** | **14.6 GB** | **[RAG 首选]** 包含了 `原始种子文本` + `DeepSeek 生成的问答`。不仅可以用来微调模型回答问题,还能训练模型区分“相关上下文”与“无关噪声”。 | | **Pure QA SFT** | **143.7 万条** | **3.4 GB** | **[微调首选]** 仅包含 `Instruction` 和 `Output`。去除了冗长的背景文,格式轻量,适合直接通过常见框架(如 LLaMA-Factory)进行快速 SFT 训练。 | ### 4. 数据样例 (Showcase) ```json { "instruction": "瑞士考古学家费迪南德·克勒是谁? 他最著名的考古发现是什么, 这对我们理解古代历史有什么帮助? ", "input": "", "output": "费迪南德·克勒 (Ferdinand Keller, 1800年12月24日 - 1881年7月21日) 是一位瑞士重要考古学家和历史学家。他最初在苏黎世工业学校担任教师,后来因其考古发现和研究工作而享誉学术界。\n\n他最著名的考古发现是1853年在瑞士发现的**湖居遗址**(也称“干栏建筑”或“湖上住宅”)。这些遗址是史前人类在湖泊和沼泽地带建造的木结构房屋,通常建在水面上方的木桩上,属于新石器时代和青铜时代的文化遗产。\n\n克勒关于湖居遗址的研究成果主要发表在《古物学会通讯》(Mitteilungen der Antiquarischen Gesellschaft)第9-20卷(1854-1879年)中。他的工作对这一领域有三方面重要贡献:\n\n1. **考古学价值**:首次系统性地记录和研究了湖居遗址的结构、功能和生活遗迹\n2. **历史意义**:揭示了公元前5000-500年左右阿尔卑斯地区史前人类的生活方式\n3. **方法论贡献**:建立了瑞士考古学研究的标准\n\n**对理解古代历史的帮助**:\n- 湖居遗址保存了有机材料(木头、纺织品、食物残渣),这在干燥遗址中很少见\n- 揭示了史前人类的建筑技术、农业实践和日常生活\n- 证明欧洲早期人类已有复杂社会组织和技术能力\n- 这些发现改变了人们对史前人类生活方式的传统认知\n\n克勒的其他重要著作包括《圣加仑修道院建筑规划(820年)》(1844年)和《东瑞士考古地图》(1874年)。他还在地理杂志和考古学期刊上发表了大量关于瑞士在罗马时期和中世纪早期历史的研究。" } ``` --- ## 📚 V2.1 Pre-train 数据集详情 (预训练篇) *(本部分保持原有 V2.1 介绍,为预训练用户提供参考)* 如果您专注于基座模型的预训练(Pre-training),V2.1 版本依然是目前开源界最大规模、质量分层最精细的中文教育语料之一。 ### 数据分层与战略价值 我们不主张“一锅端”式的训练,而是建议根据算力预算采用**Curriculum Learning (课程学习)** 策略。 * **Tier 1: 卓越质量 (Score 4-5) - [70 GB]** * **战略地位**:这是模型的“教科书”。建议在预训练的**退火阶段 (Annealing Phase)** 或 **SFT 早期** 使用。这部分数据虽少,但能显著提升模型的逻辑准确性和语言规范性。 * **Tier 2: 优质内容 (Score 3-4) - [800 GB]** * **战略地位**:这是模型的“课外书”。适合作为预训练的主力数据,帮助模型构建广泛的世界观和常识体系。 * **Tier 3: 补充语料 (Score 2-3) - [1.4 TB]** * **战略地位**:这是模型的“社会见闻”。虽然包含一定噪声,但能极大提升模型对不同语言风格的适应能力和鲁棒性。 --- ## 🛠️ 快速开始 ### 使用 Hugging Face `datasets` 库 ```python from datasets import load_dataset # 场景 A: 我想做 SFT 指令微调 # 加载纯问答对数据 (3.4GB) ds_sft = load_dataset("OpenCSG/Chinese-Fineweb-Edu-V2.2", split="sft_qa") # 场景 B: 我想做 RAG 检索训练 # 加载包含原文出处的数据 (14.6GB) ds_rag = load_dataset("OpenCSG/Chinese-Fineweb-Edu-V2.2", split="sft_context") # 场景 C: 我想做基座模型预训练 # 加载 Score 4-5 的高质量预训练语料 ds_pretrain = load_dataset("OpenCSG/Chinese-Fineweb-Edu-V2.2", data_files="pretrain/score_4_5/*.parquet") ``` --- ## 📜 许可协议 (License) 使用 **Chinese Fineweb Edu** 数据集(包含 V2.1 及 V3.0 SFT)需遵循 **OpenCSG Community License**。 * **学术研究**:无限制免费使用。 * **商业用途**:支持商业使用,但需遵守 OpenCSG 社区许可及 Apache 2.0 协议条款。如需用于商业产品,请务必发送邮件至 `lorraineg@opencsg.com` 进行报备并获取许可。

# Chinese Fineweb Edu Dataset V2.2 (Instruct & Pre-train) <div align="center"> <img width="600px" alt="OpenCSG" src="./logo.png"> [OpenCSG Community](https://opencsg.com/models) | [👾 GitHub](https://github.com/yuyijiong/fineweb-edu-chinese) | [📖 Technical Report](https://arxiv.org/abs/2501.08197) </div> ## 📌 Dataset Introduction: Filling the Data Puzzle for Chinese Educational Large Language Models **Chinese Fineweb Edu Dataset** is a high-quality dataset series specifically developed for Chinese educational large language models (LLMs). ### 🌟 Industry Background and Pain Points In the current development of large language models, high-quality Chinese educational corpora remain a scarce resource. 1. **Limitations of General Corpus**: Common datasets such as Common Crawl are massive in scale but contain extensive advertising, fragmented information, and low-quality texts. Directly using them for educational model training will lead to severe "hallucinations" and logical confusion. 2. **Scarcity of SFT Data**: Most open-source supervised fine-tuning (SFT) data in the industry focuses on general conversation or coding capabilities, **extremely lacking rigorously screened, pedagogically logical domain-specific educational question-answering data**. This leads to many models being able to chat but unable to "impart knowledge and educate" like human teachers. ### 🚀 Milestone Significance of V2.2 (January 2026) To address the above pain points, we officially launched the **V2.2 SFT Enhanced Edition** based on the massive pre-training data from V2.1. We not only provide "fuel" (pre-training data) but now also offer "engine blueprints" (SFT data). By introducing the advanced reasoning capabilities of **DeepSeek V3.2**, we transform static textbook-grade texts into dynamic, multi-turn logical question-answering pairs. This marks the evolution of Chinese Fineweb Edu from a single corpus into a complete data solution covering the entire workflow from **Pre-train (knowledge injection)** to **Post-train (behavior alignment)**. --- ## 🔥 V2.2 SFT Dataset Details: Reshaping the Post-Training Pipeline This section of data is specifically designed for Instruction Tuning, Alignment, and Retrieval-Augmented Generation (RAG) scenarios. ### 1. Why Do We Need High-Quality SFT Data? Post-training is a critical stage that endows large language models with "interaction capabilities" and "domain expertise". * **Stimulate Reasoning Abilities**: Pre-training alone gives models knowledge but not the ability to apply it. High-quality SFT data teaches models how to deduce, induce, and reason based on context. * **Reduce Hallucination Rates**: Accuracy is a red line in educational scenarios. Our question-answering pairs generated "based on the original text (Grounding)" force models to learn to "answer based on facts" rather than fabricate out of thin air. * **Simulate Pedagogical Logic**: The data generated by DeepSeek V3.2 is not just simple Fact-Checking, but also includes "explanatory" and "guidance" elements, simulating the Chain of Thought of human teachers. ### 2. Core Construction Strategy: The Refinement Path from 1.4TB to 1.5 Million Entries The core of high-quality SFT data lies in **Garbage In, Garbage Out (GIGO)**. To ensure DeepSeek generates the highest-quality question-answering pairs, we have implemented extremely strict screening on the input "seed data". #### Step 1: The "Filter" Role of the Scoring Function We reused the high-precision **Educational Reward Model** trained in the V2 version. Unlike simple keyword matching, this model scores texts from 0 to 5 based on **knowledge density, logical coherence, and academic rigor**. * **Score < 2**: Usually highly colloquial, low-information content, or advertising/marketing materials. * **Score > 4**: Usually textbook chapters, academic paper abstracts, in-depth popular science articles, or high-quality encyclopedia entries. #### Step 2: Why 1.5 Million Entries? (Top-Tier Selection) Instead of randomly sampling from hundreds of millions of data entries, we directly selected the **top 1,500,000 entries with the highest scores** as the seed data. * **What do the Top 1.5 Million Entries Represent?**: This means we only selected approximately **0.1% - 0.5%** of the content at the very top of the entire corpus. * **Profile of Seed Data**: These 1.5 million entries are mainly composed of the following content: * **STEM Fields**: Mathematical axiom derivations, detailed explanations of physics and chemistry principles, descriptions of biological evolutionary processes. * **Humanities and Social Sciences**: In-depth analysis of historical events, literary works appreciation, philosophical logical dialectics. * **Encyclopedic Knowledge**: Clearly defined, structurally complete concept explanations. * **Educational Materials**: Lesson plan designs, knowledge point summaries, difficult point analyses. #### Step 3: Logical Reconstruction by DeepSeek V3.2 We input these 1.5 million "gold standard seed" entries into DeepSeek V3.2, leveraging its strong instruction-following capabilities to perform the following tasks: 1. **Understand Context**: Deeply read the original text. 2. **Extract Core Exam Points**: Identify knowledge points that can be examined through "questions" in the text. 3. **Build Teaching Dialogues**: Generate Questions and Answers. DeepSeek is specifically prompted to generate **detailed, logically step-by-step** answers rather than short Yes/No responses. ### 3. Data Specification Table | Data Component | Data Volume | Disk Occupancy | Core Value | | --- | --- | --- | --- | | **Full Context SFT** | **1.437 Million Entries** | **14.6 GB** | **[RAG First Choice]** Includes both the `original seed text` and the `DeepSeek-generated question-answering pairs`. It can not only be used to fine-tune models to answer questions but also train models to distinguish "relevant context" from "irrelevant noise". | | **Pure QA SFT** | **1.437 Million Entries** | **3.4 GB** | **[Fine-Tuning First Choice]** Only includes `Instruction` and `Output`. Removes lengthy background text, with a lightweight format suitable for rapid SFT training via common frameworks such as LLaMA-Factory. | ### 4. Data Showcase json { "instruction": "Who is Swiss archaeologist Ferdinand Keller? What is his most famous archaeological discovery, and how does it help us understand ancient history?", "input": "", "output": "Ferdinand Keller (December 24, 1800 – July 21, 1881) was an important Swiss archaeologist and historian. He initially worked as a teacher at the Zurich Polytechnic School, and later gained academic renown for his archaeological discoveries and research work. His most famous archaeological discovery was the **lake-dwelling sites** (also known as "stilt houses" or "lake dwellings") discovered in Switzerland in 1853. These sites are prehistoric wooden structures built by humans on lakes and swamps, typically erected on wooden piles above the water surface, belonging to cultural heritage of the Neolithic and Bronze Ages. Keller's research results on lake-dwelling sites were mainly published in *Mitteilungen der Antiquarischen Gesellschaft* (Volumes 9-20, 1854-1879). His work made three important contributions to the field: 1. **Archaeological Value**: For the first time, systematically recorded and studied the structure, function, and living relics of lake-dwelling sites 2. **Historical Significance**: Revealed the lifestyle of prehistoric humans in the Alpine region around 5000-500 BC 3. **Methodological Contribution**: Established the standards for Swiss archaeological research **How it helps understand ancient history**: - Lake-dwelling sites preserve organic materials (wood, textiles, food residues), which are rare in dry archaeological sites - Revealed prehistoric human building techniques, agricultural practices, and daily life - Proved that early European humans had complex social organizations and technical capabilities - These discoveries changed the traditional public perception of prehistoric human lifestyles Keller's other important works include *Architectural Plan of St. Gall Abbey (820 AD)* (1844) and *Archaeological Map of Eastern Switzerland* (1874). He also published numerous studies on the history of Switzerland during the Roman period and early Middle Ages in geographical magazines and archaeological journals." } --- ## 📚 V2.1 Pre-training Dataset Details (Pre-training Section) *(This section retains the original V2.1 introduction to provide reference for pre-training users)* If you focus on pre-training for base models, the V2.1 version remains one of the largest-scale, most finely stratified Chinese educational corpora in the open-source community. ### Data Stratification and Strategic Value We do not advocate "one-pot" training, but rather recommend adopting a **Curriculum Learning** strategy based on computing power budgets. * **Tier 1: Excellent Quality (Score 4-5) - [70 GB]** * **Strategic Value**: This is the model's "textbook". It is recommended for use during the **Annealing Phase** of pre-training or **early SFT**. Although this dataset is small, it can significantly improve the model's logical accuracy and language standardization. * **Tier 2: High-Quality Content (Score 3-4) - [800 GB]** * **Strategic Value**: This is the model's "extracurricular reading". It is suitable as the main data for pre-training, helping the model build a broad worldview and common sense system. * **Tier 3: Supplementary Corpus (Score 2-3) - [1.4 TB]** * **Strategic Value**: This is the model's "social insights". Although it contains some noise, it can greatly improve the model's adaptability to different language styles and robustness. --- ## 🛠️ Quick Start ### Using the Hugging Face `datasets` Library python from datasets import load_dataset # Scenario A: I want to conduct SFT instruction fine-tuning # Load pure question-answering dataset (3.4GB) ds_sft = load_dataset("OpenCSG/Chinese-Fineweb-Edu-V2.2", split="sft_qa") # Scenario B: I want to perform RAG retrieval training # Load data containing original source text (14.6GB) ds_rag = load_dataset("OpenCSG/Chinese-Fineweb-Edu-V2.2", split="sft_context") # Scenario C: I want to conduct base model pre-training # Load high-quality pre-training corpus with Score 4-5 ds_pretrain = load_dataset("OpenCSG/Chinese-Fineweb-Edu-V2.2", data_files="pretrain/score_4_5/*.parquet") --- ## 📜 License Agreement Using the **Chinese Fineweb Edu** dataset (including V2.1 and V3.0 SFT) requires compliance with the **OpenCSG Community License**. * **Academic Research**: Unlimited free use. * **Commercial Use**: Commercial use is supported, but must comply with the terms of the OpenCSG Community License and the Apache 2.0 license. If you wish to use it for commercial products, please be sure to send an email to `lorraineg@opencsg.com` for registration and obtain permission.
提供机构:
maas
创建时间:
2026-02-05
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作