five

AM-DeepSeek-R1-0528-Distilled

收藏
魔搭社区2026-01-08 更新2025-06-07 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/AM-DeepSeek-R1-0528-Distilled
下载链接
链接失效反馈
官方服务:
资源简介:
## 📘 Dataset Summary This dataset is a high-quality reasoning corpus **distilled from DeepSeek-R1-0528**, an improved version of the DeepSeek-R1 large language model. Compared to its initial release, DeepSeek-R1-0528 demonstrates significant advances in reasoning, instruction following, and multi-turn dialogue. Motivated by these improvements, we collected and distilled a diverse set of **2.6 million queries** across multiple domains, using DeepSeek-R1-0528 as the teacher. A notable characteristic of DeepSeek-R1-0528 is that its outputs are substantially longer than previous versions, especially in mathematics: for some math problems, the output length is **1.5 to 2 times longer** than earlier generations. This reflects more detailed, explicit step-by-step reasoning. The dataset follows a unified format and verification pipeline, enabling direct comparison with other open-source distillation corpora. It is intended to support the development of next-generation language models with strong, verifiable reasoning abilities. **Performance on this dataset training with [Qwen2.5-32B](https://huggingface.co/Qwen/Qwen2.5-32B).** | Benchmark | [DeepSeek-R1-0528](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528) | AM-DeepSeek-R1-0528-Distilled | |-------------------|--------------------------|---------------------------| | AIME2024 | 91.4 | 87.1 | ## 📂 Dataset Structure ### Data Fields Each sample is a dictionary with the following fields: - `system`: The system prompt used during distillation, typically guiding structured reasoning via `<think>` and `<answer>` tags. - Note: Some instance's 'system' fields in our dataset are empty. The 'system' field is not used in training. Feel free to use them. - `conversations`: A list of dialogue turns structured as: - `from`: Either `'human'` or `'assistant'`. - `value`: Full message content. - `info`: Metadata dictionary containing: - `source`: Dataset origin (e.g., `OpenHermes-2.5`). - `category`: Task domain (e.g., `math`, `code`, `general chat`). - `ground_truth`: Ground truth reference (if applicable). - `test_case`: Associated test case ID (optional). - `instruction_constrain`: Instruction constraint metadata (optional). - `think_content`: Assistant’s reasoning trace. - `answer_content`: Final answer segment. - `verify_score`: Verification confidence score (float ≥ 0.9). - `model_name`: Name of the teacher model (`deepseek-r1-0528`). - `ppl`: Perplexity of the assistant’s output. ## 📈 Dataset Statistics - Shared query base: **2.6 million** unique prompts - Responses distilled from **DeepSeek-R1-0528** - Task Category Breakdown: - **general chat**: 1,223K (47.3%) - **math**: 674K (26.1%) - **code**: 412K (16.0%) - **science**: 220K (8.5%) - **if**: 54K (2.1%) - Each sample is verified and filtered for output quality. ![alt text](AM-distilled.png) > Note that general chat includes both multiturn and other types of data. ## ✅ Verification and Quality Control All outputs underwent **automated verification**, with methods tailored to task categories: - **Math**: Math-Verify (binary pass/fail) - **Code**: Test-case based validation in sandbox environments - **Science**: Answer similarity via LLM scoring - **Instruction Follow**: Verified by `IFEval` validator - **General Chat**: Evaluated using a reward model (e.g., Decision-Tree-Reward-Llama-3.1-8B) Each dataset individually applies: - Perplexity filtering using a strong 32B LLM - N-gram repetition filtering - Structural formatting checks (e.g., presence of `<think>` and `<answer>`) ## ⚠️ Limitations Developers should strictly limit the use of this project’s open-sourced code, data, models, and related artifacts to **research purposes only**. **Commercial use and any applications that could potentially cause harm are strictly prohibited**. The content in this dataset does not reflect the views, beliefs, or endorsements of any individual or institution. The authors disclaim any responsibility for consequences arising from the use, misuse, or interpretation of the dataset and associated materials. ## 📜 Citation If you use this dataset, please cite: ``` @misc{AM-DeepSeek-R1-0528-Distilled, title = {AM-DeepSeek-R1-0528-Distilled}, url = {https://github.com/a-m-team/a-m-models}, author = {a-m-team}, month = {June}, year = {2025} } ```

📘 数据集概述 本数据集是从DeepSeek-R1-0528中蒸馏得到的高质量推理语料库,DeepSeek-R1-0528是DeepSeek-R1大语言模型(Large Language Model)的改进版本。相较于初始发布版本,DeepSeek-R1-0528在推理能力、指令遵循能力及多轮对话方面均实现了显著提升。基于这些改进,我们以DeepSeek-R1-0528作为教师模型,收集并蒸馏得到了覆盖多个领域的**260万条查询样本**。 DeepSeek-R1-0528的一个显著特征是其输出长度较前代版本大幅增加,在数学任务中尤为突出:部分数学问题的输出长度是前代模型的**1.5至2倍**,这体现了更细致、明确的分步推理过程。 本数据集采用统一格式与验证流程,可直接与其他开源蒸馏语料库进行对比,旨在支持具备强可验证推理能力的下一代语言模型研发。 **基于[Qwen2.5-32B](https://huggingface.co/Qwen/Qwen2.5-32B)在本数据集上的训练性能** | 基准测试 | [DeepSeek-R1-0528](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528) | AM-DeepSeek-R1-0528-Distilled | |----------|-----------------------------------------------------------------------|-------------------------------| | AIME2024 | 91.4 | 87.1 | 📂 数据集结构 ### 数据字段 每个样本均为包含以下字段的字典: - `system`:蒸馏过程中使用的系统提示词,通常通过`<think>`和`<answer>`标签引导结构化推理。 - 注意:本数据集中部分样本的`system`字段为空,且训练过程中无需使用该字段,可自由按需使用。 - `conversations`:结构化的对话轮次列表,格式为: - `from`:取值为`'human'`或`'assistant'`,分别代表用户与助手。 - `value`:完整的消息内容。 - `info`:元数据字典,包含以下内容: - `source`:数据集来源(例如`OpenHermes-2.5`)。 - `category`:任务领域(例如`math`、`code`、`general chat`)。 - `ground_truth`:真实参考答案(如适用)。 - `test_case`:关联的测试用例ID(可选)。 - `instruction_constrain`:指令约束元数据(可选)。 - `think_content`:助手的推理轨迹。 - `answer_content`:最终答案片段。 - `verify_score`:验证置信度分数(浮点型,≥0.9)。 - `model_name`:教师模型名称(`deepseek-r1-0528`)。 - `ppl`:助手输出的困惑度(Perplexity)。 📈 数据集统计 - 共享查询库:**260万**条唯一提示词 - 蒸馏得到的回复均来自**DeepSeek-R1-0528** - 任务类别分布: - **通用对话(general chat)**:122.3万条(47.3%) - **数学(math)**:67.4万条(26.1%) - **代码(code)**:41.2万条(16.0%) - **科学(science)**:22.0万条(8.5%) - **if**:5.4万条(2.1%) - 所有样本均经过验证与输出质量过滤。 ![alt text](AM-distilled.png) > 注:通用对话包含多轮对话及其他类型的数据。 ✅ 验证与质量控制 所有输出均经过**自动化验证**,验证方法针对不同任务类别定制: - **数学任务**:使用Math-Verify进行二元分类(通过/不通过) - **代码任务**:在沙箱环境中基于测试用例进行验证 - **科学任务**:通过大语言模型评分计算答案相似度 - **指令遵循任务**:使用`IFEval`验证器进行校验 - **通用对话**:使用奖励模型(例如Decision-Tree-Reward-Llama-3.1-8B)进行评估 本数据集统一应用以下筛选规则: - 使用高性能32B大语言模型进行困惑度过滤 - N-gram重复内容过滤 - 结构格式检查(例如是否包含`<think>`和`<answer>`标签) ⚠️ 局限性 开发者需严格将本项目的开源代码、数据、模型及相关制品**仅用于研究目的**,**严格禁止商业使用及任何可能造成危害的应用**。 本数据集包含的内容不代表任何个人或机构的观点、信念或背书。作者不对因使用、误用或解读本数据集及相关材料所产生的后果承担任何责任。 📜 引用 若使用本数据集,请引用: @misc{AM-DeepSeek-R1-0528-Distilled, title = {AM-DeepSeek-R1-0528-Distilled}, url = {https://github.com/a-m-team/a-m-models}, author = {a-m-team}, month = {June}, year = {2025} }
提供机构:
maas
创建时间:
2025-06-06
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
该数据集是从DeepSeek-R1-0528模型蒸馏得到的高质量推理语料库,包含260万个查询,覆盖通用对话、数学、代码、科学和指令遵循等多个领域。数据集经过严格的自动化验证和质量控制,旨在支持下一代语言模型的推理能力开发,但仅限于研究用途。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作