kabsis/AM-DeepSeek-R1-0528-Distilled
收藏Hugging Face2025-12-19 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/kabsis/AM-DeepSeek-R1-0528-Distilled
下载链接
链接失效反馈官方服务:
资源简介:
---
task_categories:
- text-generation
language:
- en
- zh
tags:
- reasoning
size_categories:
- 1M<n<10M
---
## 📘 Dataset Summary
This dataset is a high-quality reasoning corpus **distilled from DeepSeek-R1-0528**, an improved version of the DeepSeek-R1 large language model. Compared to its initial release, DeepSeek-R1-0528 demonstrates significant advances in reasoning, instruction following, and multi-turn dialogue. Motivated by these improvements, we collected and distilled a diverse set of **2.6 million queries** across multiple domains, using DeepSeek-R1-0528 as the teacher.
A notable characteristic of DeepSeek-R1-0528 is that its outputs are substantially longer than previous versions, especially in mathematics: for some math problems, the output length is **1.5 to 2 times longer** than earlier generations. This reflects more detailed, explicit step-by-step reasoning.
The dataset follows a unified format and verification pipeline, enabling direct comparison with other open-source distillation corpora. It is intended to support the development of next-generation language models with strong, verifiable reasoning abilities.
**Performance on this dataset training with [Qwen2.5-32B](https://huggingface.co/Qwen/Qwen2.5-32B).**
| Benchmark | [DeepSeek-R1-0528](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528) | AM-DeepSeek-R1-0528-Distilled |
|-------------------|--------------------------|---------------------------|
| AIME2024 | 91.4 | 87.1 |
## 📂 Dataset Structure
### Data Fields
Each sample is a dictionary with the following fields:
- `system`: The system prompt used during distillation, typically guiding structured reasoning via `<think>` and `<answer>` tags.
- Note: Some instance's 'system' fields in our dataset are empty. The 'system' field is not used in training. Feel free to use them.
- `conversations`: A list of dialogue turns structured as:
- `from`: Either `'human'` or `'assistant'`.
- `value`: Full message content.
- `info`: Metadata dictionary containing:
- `source`: Dataset origin (e.g., `OpenHermes-2.5`).
- `category`: Task domain (e.g., `math`, `code`, `general chat`).
- `ground_truth`: Ground truth reference (if applicable).
- `test_case`: Associated test case ID (optional).
- `instruction_constrain`: Instruction constraint metadata (optional).
- `think_content`: Assistant’s reasoning trace.
- `answer_content`: Final answer segment.
- `verify_score`: Verification confidence score (float ≥ 0.9).
- `model_name`: Name of the teacher model (`deepseek-r1-0528`).
- `ppl`: Perplexity of the assistant’s output.
## 📈 Dataset Statistics
- Shared query base: **2.6 million** unique prompts
- Responses distilled from **DeepSeek-R1-0528**
- Task Category Breakdown:
- **general chat**: 1,223K (47.3%)
- **math**: 674K (26.1%)
- **code**: 412K (16.0%)
- **science**: 220K (8.5%)
- **if**: 54K (2.1%)
- Each sample is verified and filtered for output quality.

> Note that general chat includes both multiturn and other types of data.
## ✅ Verification and Quality Control
All outputs underwent **automated verification**, with methods tailored to task categories:
- **Math**: Math-Verify (binary pass/fail)
- **Code**: Test-case based validation in sandbox environments
- **Science**: Answer similarity via LLM scoring
- **Instruction Follow**: Verified by `IFEval` validator
- **General Chat**: Evaluated using a reward model (e.g., Decision-Tree-Reward-Llama-3.1-8B)
Each dataset individually applies:
- Perplexity filtering using a strong 32B LLM
- N-gram repetition filtering
- Structural formatting checks (e.g., presence of `<think>` and `<answer>`)
## ⚠️ Limitations
Developers should strictly limit the use of this project’s open-sourced code, data, models, and related artifacts to **research purposes only**. **Commercial use and any applications that could potentially cause harm are strictly prohibited**.
The content in this dataset does not reflect the views, beliefs, or endorsements of any individual or institution. The authors disclaim any responsibility for consequences arising from the use, misuse, or interpretation of the dataset and associated materials.
## 📜 Citation
If you use this dataset, please cite:
```
@misc{AM-DeepSeek-R1-0528-Distilled,
title = {AM-DeepSeek-R1-0528-Distilled},
url = {https://github.com/a-m-team/a-m-models},
author = {a-m-team},
month = {June},
year = {2025}
}
```
task_categories:
- 文本生成
language:
- 英语
- 中文
tags:
- 推理
size_categories:
- 100万<n<1000万
---
## 📘 数据集摘要
本数据集是从DeepSeek-R1-0528(DeepSeek-R1大语言模型(Large Language Model)的改进版本)中蒸馏得到的高质量推理语料库。相较于初始发布版本,DeepSeek-R1-0528在推理能力、指令遵循及多轮对话方面均实现了显著提升。基于这些改进,我们以DeepSeek-R1-0528作为教师模型,收集并蒸馏得到了覆盖多领域的**260万条查询**。
DeepSeek-R1-0528的一个显著特征是其输出长度较前代版本大幅增加,尤其在数学领域:部分数学问题的输出长度是前代模型的**1.5至2倍**,这体现了更细致、明确的分步推理过程。
本数据集采用统一格式与验证流程,可直接与其他开源蒸馏语料库进行对比,旨在支持具备强可验证推理能力的下一代语言模型开发。
**基于本数据集训练[Qwen2.5-32B](https://huggingface.co/Qwen/Qwen2.5-32B)的性能表现**:
| 评测基准 | [DeepSeek-R1-0528](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528) | AM-DeepSeek-R1-0528-Distilled |
|-------------------|--------------------------|---------------------------|
| AIME2024 | 91.4 | 87.1 |
## 📂 数据集结构
### 数据字段
每条样本是包含以下字段的字典:
- `system`:蒸馏过程中使用的系统提示词(system prompt),通常通过`<think>`和`<answer>`标签引导结构化推理。
- 注:本数据集中部分样本的`system`字段为空。该字段无需用于训练,可自由使用。
- `conversations`:结构化的对话轮次(dialogue turns)列表,格式为:
- `from`:取值为`'human'`或`'assistant'`。
- `value`:完整的消息内容。
- `info`:包含以下元数据的字典:
- `source`:数据集来源(例如`OpenHermes-2.5`)。
- `category`:任务领域(例如`math`、`code`、`general chat`)。
- `ground_truth`:真实参考答案(如适用)。
- `test_case`:关联的测试用例ID(可选)。
- `instruction_constrain`:指令约束元数据(可选)。
- `think_content`:助手的推理轨迹。
- `answer_content`:最终答案片段。
- `verify_score`:验证置信度得分(浮点型,≥0.9)。
- `model_name`:教师模型名称(`deepseek-r1-0528`)。
- `ppl`:助手输出的困惑度(perplexity)。
## 📈 数据集统计
- 共享查询库:**260万条唯一提示词**
- 响应由**DeepSeek-R1-0528**蒸馏得到
- 任务类别分布:
- **通用闲聊(general chat)**:122.3万条(47.3%)
- **数学(math)**:67.4万条(26.1%)
- **代码(code)**:41.2万条(16.0%)
- **科学(science)**:22.0万条(8.5%)
- **指令遵循(if)**:5.4万条(2.1%)
- 所有样本均经过验证与过滤,以保证输出质量。

> 注:通用闲聊包含多轮对话及其他类型的数据。
## ✅ 验证与质量控制
所有输出均经过**自动化验证**,验证方法针对不同任务类别定制:
- **数学任务**:使用Math-Verify进行二元对错校验
- **代码任务**:在沙箱环境中基于测试用例进行验证
- **科学任务**:通过大语言模型评分计算答案相似度
- **指令遵循任务**:由`IFEval`验证器进行校验
- **通用闲聊任务**:使用奖励模型(例如Decision-Tree-Reward-Llama-3.1-8B)进行评估
每条数据集均单独应用以下过滤规则:
- 使用高性能32B大语言模型进行困惑度过滤
- N-gram重复内容过滤
- 结构格式检查(例如`<think>`和`<answer>`标签的存在性)
## ⚠️ 局限性
开发者应严格将本项目的开源代码、数据、模型及相关制品仅用于**研究目的**。**商业使用及任何可能造成危害的应用均被严格禁止**。
本数据集的内容不代表任何个人或机构的观点、信仰或背书。作者不对因使用、误用或解读本数据集及相关材料所产生的后果承担任何责任。
## 📜 引用
若使用本数据集,请引用:
@misc{AM-DeepSeek-R1-0528-Distilled,
title = {AM-DeepSeek-R1-0528-Distilled},
url = {https://github.com/a-m-team/a-m-models},
author = {a-m-team},
month = {June},
year = {2025}
}
提供机构:
kabsis



