AM-Qwen3-Distilled
收藏魔搭社区2026-01-06 更新2025-05-24 收录
下载链接:
https://modelscope.cn/datasets/a-m-team/AM-Qwen3-Distilled
下载链接
链接失效反馈官方服务:
资源简介:
## 📘 Dataset Summary
AM-Thinking-v1 and Qwen3-235B-A22B are two reasoning datasets distilled from state-of-the-art teacher models. Each dataset contains high-quality, automatically verified responses generated from a shared set of **1.89 million queries** spanning a wide range of reasoning domains.
The datasets share the same format and verification pipeline, allowing for direct comparison and seamless integration into downstream tasks. They are intended to support the development of open-source language models with strong reasoning abilities. Benchmark results show their effectiveness on AIME2024, AIME2025, MATH500, and LiveCodeBench.
For the AM-Thinking-v1-Distilled dataset, see: https://huggingface.co/datasets/a-m-team/AM-Thinking-v1-Distilled for more details.
## 📊 Benchmark Performance

| Benchmark | [AM-Thinking-v1 Distilled](https://huggingface.co/datasets/a-m-team/AM-Thinking-v1-Distilled) | [Qwen3-235B-A22B Distilled](https://huggingface.co/datasets/a-m-team/AM-Qwen3-Distilled) | DeepSeek-R1 Distilled|[Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B) | [AM-Thinking-v1](https://huggingface.co/a-m-team/AM-Thinking-v1) | [Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B) | [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) |
|-------------------|--------------------------|---------------------------|---------------------------|--------------------------|--------------------------|---------------------------|---------------------------|
| AIME2024 | **84.3** | 79.4 | 70.9 | 81.4| 85.3 | 85.7 | 79.8 |
| AIME2025 | **72.2** | 62.2 |52.8 | 72.9| 74.4 | 81.5 | 70.0|
| MATH500 | **98.4** | 93.9 |95.8 | - |-| - |-|
| LiveCodeBench | **65.9** | 59.6 | 57.0| 65.7| 70.3 | 70.7 | 64.3|
These results reflect models trained separately using each dataset, demonstrating the impact of teacher model quality on downstream reasoning capabilities.
## 📂 Dataset Structure
### Data Fields
Each sample is a dictionary with the following fields:
- `system`: The system prompt used during distillation, typically guiding structured reasoning via `<think>` and `<answer>` tags.
Note: Some instance's 'system' fields in our dataset are empty. 'system' field is not used in training. Feel free to use them.
- `conversations`: A list of dialogue turns structured as:
- `from`: Either `'human'` or `'assistant'`.
- `value`: Full message content.
- `info`: Metadata dictionary containing:
- `source`: Dataset origin (e.g., `OpenHermes-2.5`).
- `category`: Task domain (e.g., `math`, `code`, `other`).
- `ground_truth`: Ground truth reference (if applicable).
- `test_case`: Associated test case ID (optional).
- `instruction_constrain`: Instruction constraint metadata (optional).
- `think_content`: Assistant’s reasoning trace.
- `answer_content`: Final answer segment.
- `verify_score`: Verification confidence score (float ≥ 0.9).
- `model_name`: Name of the teacher model (`am_thinking_v1` or `qwen3_235b_a22b`).
- `ppl`: Perplexity of the assistant’s output.
## 📈 Dataset Statistics
- Shared query base: **1.89 million** unique prompts
- Each dataset contains responses distilled from one of the following models:
- **AM-Thinking-v1**
- **Qwen3-235B-A22B**
- Task Category Breakdown:
- General Chat: ~41.8%
- Mathematical Reasoning: ~29.5%
- Code Generation: ~17.1%
- Others (Science, Instruction Following, Dialogue): ~11.6%
The general chat includes both multi-turn conversations and other types of data.

## ✅ Verification and Quality Control
All outputs underwent **automated verification**, with methods tailored to task categories:
- **Math**: Math-Verify (binary pass/fail)
- **Code**: Test-case based validation in sandbox environments
- **Science**: Answer similarity via LLM scoring
- **Instruction Follow**: Verified by `IFEval` validator
- **General Chat**: Evaluated using a reward model (e.g., Decision-Tree-Reward-Llama-3.1-8B)
Each dataset individually applies:
- Perplexity filtering using a strong 32B LLM
- N-gram repetition filtering
- Structural formatting checks (e.g., presence of `<think>` and `<answer>`)
## ⚠️ Limitations
Developers should strictly limit the use of this project’s open-sourced code, data, models, and related artifacts to **research purposes only**. **Commercial use and any applications that could potentially cause harm are strictly prohibited**.
The content in this dataset does not reflect the views, beliefs, or endorsements of any individual or institution. The authors disclaim any responsibility for consequences arising from the use, misuse, or interpretation of the dataset and associated materials.
## 📜 Citation
If you use either dataset, please cite:
```
@misc{tian2025correctanswersequaldistillation,
title={Not All Correct Answers Are Equal: Why Your Distillation Source Matters},
author={Xiaoyu Tian and Yunjie Ji and Haotian Wang and Shuaiting Chen and Sitong Zhao and Yiping Peng and Han Zhao and Xiangang Li},
year={2025},
eprint={2505.14464},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.14464},
}
```
📘 数据集概述
AM-Thinking-v1与Qwen3-235B-A22B是两类从顶尖教师模型中蒸馏得到的推理数据集。两类数据集均基于共通的**189万条查询集**生成,涵盖广泛的推理领域,且包含经自动验证的高质量回复。
两类数据集采用统一的格式与验证流程,支持直接对比分析,且可无缝集成至下游任务中。其研发目标为助力具备强推理能力的开源大语言模型(Large Language Model,LLM)开发。基准测试结果证实,该数据集在AIME2024、AIME2025、MATH500及LiveCodeBench等任务上均展现出有效性。有关AM-Thinking-v1-Distilled数据集的详细信息,请访问:https://huggingface.co/datasets/a-m-team/AM-Thinking-v1-Distilled。
📊 基准测试性能

| 基准测试 | [AM-Thinking-v1 蒸馏版](https://huggingface.co/datasets/a-m-team/AM-Thinking-v1-Distilled) | [Qwen3-235B-A22B 蒸馏版](https://huggingface.co/datasets/a-m-team/AM-Qwen3-Distilled) | DeepSeek-R1 蒸馏版 | [Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B) | [AM-Thinking-v1](https://huggingface.co/a-m-team/AM-Thinking-v1) | [Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B) | [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) |
|-------------------|--------------------------|---------------------------|---------------------------|--------------------------|--------------------------|---------------------------|---------------------------|
| AIME2024 | **84.3** | 79.4 | 70.9 | 81.4| 85.3 | 85.7 | 79.8 |
| AIME2025 | **72.2** | 62.2 |52.8 | 72.9| 74.4 | 81.5 | 70.0|
| MATH500 | **98.4** | 93.9 |95.8 | - |-| - |-|
| LiveCodeBench | **65.9** | 59.6 | 57.0| 65.7| 70.3 | 70.7 | 64.3|
上述结果均为使用单类数据集单独训练的模型所得,证实了教师模型质量对下游推理能力的影响。
📂 数据集结构
### 数据字段
每个样本均为包含以下字段的字典:
- `system`: 蒸馏过程中使用的系统提示词,通常通过`<think>`与`<answer>`标签引导结构化推理。
注意:本数据集部分样本的`system`字段为空,且训练过程中未使用该字段,使用者可根据需求自由使用。
- `conversations`: 结构化的对话轮次列表,格式为:
- `from`: 取值为`'human'`(人类)或`'assistant'`(助手)。
- `value`: 完整的消息内容。
- `info`: 包含以下元数据的字典:
- `source`: 数据集来源(例如`OpenHermes-2.5`)。
- `category`: 任务领域(例如`math`(数学)、`code`(代码)、`other`(其他))。
- `ground_truth`: 地面真值参考(若适用)。
- `test_case`: 关联的测试用例ID(可选)。
- `instruction_constrain`: 指令约束元数据(可选)。
- `think_content`: 助手的推理轨迹。
- `answer_content`: 最终答案片段。
- `verify_score`: 验证置信度得分(浮点型数值≥0.9)。
- `model_name`: 教师模型名称(`am_thinking_v1`或`qwen3_235b_a22b`)。
- `ppl`: 助手输出的困惑度(Perplexity)。
📈 数据集统计信息
- 共通查询库:**189万**条唯一提示词
- 每类数据集分别包含从以下模型蒸馏得到的回复:
- **AM-Thinking-v1**
- **Qwen3-235B-A22B**
- 任务类别分布:
- 通用对话:约41.8%
- 数学推理:约29.5%
- 代码生成:约17.1%
- 其他(科学、指令遵循、对话):约11.6%
通用对话包含多轮对话及其他类型的数据。

✅ 验证与质量管控
所有输出均经过**自动化验证**,验证方法针对不同任务类别定制:
- **数学任务**:采用Math-Verify进行二元(通过/不通过)验证
- **代码任务**:在沙箱环境中基于测试用例进行验证
- **科学任务**:通过大语言模型评分计算答案相似度
- **指令遵循任务**:由`IFEval`验证器进行校验
- **通用对话任务**:使用奖励模型(例如Decision-Tree-Reward-Llama-3.1-8B)进行评估
每类数据集均单独应用以下过滤与校验流程:
- 基于高性能32B大语言模型的困惑度过滤
- N-gram重复内容过滤
- 结构化格式检查(例如`<think>`与`<answer>`标签的存在性)
⚠️ 使用限制
开发者需严格将本项目开源代码、数据、模型及相关工件仅用于**研究用途**。**商业使用及任何可能造成危害的应用均被严格禁止**。
本数据集包含的内容不代表任何个人或机构的观点、信仰或背书。作者不对因使用、误用或解读本数据集及相关材料所产生的后果承担任何责任。
📜 引用说明
若使用本数据集,请引用以下文献:
@misc{tian2025correctanswersequaldistillation,
title={Not All Correct Answers Are Equal: Why Your Distillation Source Matters},
author={Xiaoyu Tian and Yunjie Ji and Haotian Wang and Shuaiting Chen and Sitong Zhao and Yiping Peng and Han Zhao and Xiangang Li},
year={2025},
eprint={2505.14464},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.14464},
}
提供机构:
maas
创建时间:
2025-05-21



