five

AM-Thinking-v1-Distilled

收藏
魔搭社区2026-05-09 更新2025-05-24 收录
下载链接:
https://modelscope.cn/datasets/a-m-team/AM-Thinking-v1-Distilled
下载链接
链接失效反馈
官方服务:
资源简介:
## 📘 Dataset Summary AM-Thinking-v1 and Qwen3-235B-A22B are two reasoning datasets distilled from state-of-the-art teacher models. Each dataset contains high-quality, automatically verified responses generated from a shared set of **1.89 million queries** spanning a wide range of reasoning domains. The datasets share the same format and verification pipeline, allowing for direct comparison and seamless integration into downstream tasks. They are intended to support the development of open-source language models with strong reasoning abilities. Benchmark results show their effectiveness on AIME2024, AIME2025, MATH500, and LiveCodeBench. For the AM-Qwen3-Distilled dataset, see: https://huggingface.co/datasets/a-m-team/AM-Qwen3-Distilled for more details. ## 📊 Benchmark Performance ![alt text](benchmarks.png) | Benchmark | [AM-Thinking-v1 Distilled](https://huggingface.co/datasets/a-m-team/AM-Thinking-v1-Distilled) | [Qwen3-235B-A22B Distilled](https://huggingface.co/datasets/a-m-team/AM-Qwen3-Distilled) | DeepSeek-R1 Distilled|[Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B) | [AM-Thinking-v1](https://huggingface.co/a-m-team/AM-Thinking-v1) | [Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B) | [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) | |-------------------|--------------------------|---------------------------|---------------------------|--------------------------|--------------------------|---------------------------|---------------------------| | AIME2024 | **84.3** | 79.4 | 70.9 | 81.4| 85.3 | 85.7 | 79.8 | | AIME2025 | **72.2** | 62.2 |52.8 | 72.9| 74.4 | 81.5 | 70.0| | MATH500 | **98.4** | 93.9 |95.8 | - |-| - |-| | LiveCodeBench | **65.9** | 59.6 | 57.0| 65.7| 70.3 | 70.7 | 64.3| These results reflect models trained separately using each dataset, demonstrating the impact of teacher model quality on downstream reasoning capabilities. ## 📂 Dataset Structure ### Data Fields Each sample is a dictionary with the following fields: - `system`: The system prompt used during distillation, typically guiding structured reasoning via `<think>` and `<answer>` tags. Note: Some instance's 'system' fields in our dataset are empty. 'system' field is not used in training. Feel free to use them. - `conversations`: A list of dialogue turns structured as: - `from`: Either `'human'` or `'assistant'`. - `value`: Full message content. - `info`: Metadata dictionary containing: - `source`: Dataset origin (e.g., `OpenHermes-2.5`). - `category`: Task domain (e.g., `math`, `code`, `other`). - `ground_truth`: Ground truth reference (if applicable). - `test_case`: Associated test case ID (optional). - `instruction_constrain`: Instruction constraint metadata (optional). - `think_content`: Assistant’s reasoning trace. - `answer_content`: Final answer segment. - `verify_score`: Verification confidence score (float ≥ 0.9). - `model_name`: Name of the teacher model (`am_thinking_v1` or `qwen3_235b_a22b`). - `ppl`: Perplexity of the assistant’s output. ## 📈 Dataset Statistics - Shared query base: **1.89 million** unique prompts - Each dataset contains responses distilled from one of the following models: - **AM-Thinking-v1** - **Qwen3-235B-A22B** - Task Category Breakdown: - General Chat: ~41.8% - Mathematical Reasoning: ~29.5% - Code Generation: ~17.1% - Others (Science, Instruction Following, Dialogue): ~11.6% The general chat includes both multi-turn conversations and other types of data. ![alt text](AM-distilled.png) ## ✅ Verification and Quality Control All outputs underwent **automated verification**, with methods tailored to task categories: - **Math**: Math-Verify (binary pass/fail) - **Code**: Test-case based validation in sandbox environments - **Science**: Answer similarity via LLM scoring - **Instruction Follow**: Verified by `IFEval` validator - **General Chat**: Evaluated using a reward model (e.g., Decision-Tree-Reward-Llama-3.1-8B) Each dataset individually applies: - Perplexity filtering using a strong 32B LLM - N-gram repetition filtering - Structural formatting checks (e.g., presence of `<think>` and `<answer>`) ## ⚠️ Limitations Developers should strictly limit the use of this project’s open-sourced code, data, models, and related artifacts to **research purposes only**. **Commercial use and any applications that could potentially cause harm are strictly prohibited**. The content in this dataset does not reflect the views, beliefs, or endorsements of any individual or institution. The authors disclaim any responsibility for consequences arising from the use, misuse, or interpretation of the dataset and associated materials. ## 📜 Citation If you use either dataset, please cite: ``` @misc{tian2025correctanswersequaldistillation, title={Not All Correct Answers Are Equal: Why Your Distillation Source Matters}, author={Xiaoyu Tian and Yunjie Ji and Haotian Wang and Shuaiting Chen and Sitong Zhao and Yiping Peng and Han Zhao and Xiangang Li}, year={2025}, eprint={2505.14464}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.14464}, } ```

📘 数据集概述 AM-Thinking-v1与Qwen3-235B-A22B是两款从顶尖教师模型(teacher model)中蒸馏得到的推理数据集。两款数据集均基于涵盖广泛推理领域的共享189万条查询生成,且所有回复均经过自动校验,品质精良。 两款数据集采用一致的格式与校验流程(verification pipeline),可直接进行对比并无缝集成至下游任务中。本数据集旨在助力具备强推理能力的开源大语言模型(Large Language Model, LLM)的研发。基准测试结果显示,其在AIME2024、AIME2025、MATH500及LiveCodeBench等评测任务中均展现出优异性能。 如需了解AM-Qwen3-Distilled数据集的更多详情,请访问:https://huggingface.co/datasets/a-m-team/AM-Qwen3-Distilled。 📊 基准测试性能 ![alt text](benchmarks.png) | 基准测试任务 | [AM-Thinking-v1 蒸馏版](https://huggingface.co/datasets/a-m-team/AM-Thinking-v1-Distilled) | [Qwen3-235B-A22B 蒸馏版](https://huggingface.co/datasets/a-m-team/AM-Qwen3-Distilled) | [DeepSeek-R1 蒸馏版] | [Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B) | [AM-Thinking-v1](https://huggingface.co/a-m-team/AM-Thinking-v1) | [Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B) | [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) | |-------------------|--------------------------|---------------------------|---------------------------|--------------------------|--------------------------|---------------------------|---------------------------| | AIME2024 | **84.3** | 79.4 | 70.9 | 81.4| 85.3 | 85.7 | 79.8 | | AIME2025 | **72.2** | 62.2 |52.8 | 72.9| 74.4 | 81.5 | 70.0| | MATH500 | **98.4** | 93.9 |95.8 | - |-| - |-| | LiveCodeBench | **65.9** | 59.6 | 57.0| 65.7| 70.3 | 70.7 | 64.3| 上述结果均为使用单数据集单独训练的模型所得,直观体现了教师模型质量对下游推理能力的影响。 📂 数据集结构 ### 数据字段 每条样本均为字典格式,包含以下字段: - `"system"`:蒸馏过程中使用的系统提示词(system prompt),通常通过`<think>`与`<answer>`标签引导结构化推理。注意:本数据集部分样本的`"system"`字段为空,且训练过程中并未使用该字段,可自由按需使用。 - `"conversations"`:对话轮次列表,结构如下: - `"from"`:取值为`'human'`(人类)或`'assistant'`(助手)。 - `"value"`:完整的消息内容。 - `"info"`:包含以下内容的元数据字典: - `"source"`:数据集来源(例如`OpenHermes-2.5`)。 - `"category"`:任务领域(例如`math`(数学)、`code`(代码)、`other`(其他))。 - `"ground_truth"`:标准答案参考(如适用)。 - `"test_case"`:关联的测试用例ID(可选)。 - `"instruction_constrain"`:指令约束元数据(可选)。 - `"think_content"`:助手的推理过程轨迹。 - `"answer_content"`:最终答案片段。 - `"verify_score"`:校验置信度分数(浮点数,≥0.9)。 - `"model_name"`:教师模型名称(`am_thinking_v1`或`qwen3_235b_a22b`)。 - `"ppl"`:助手输出的困惑度(Perplexity)。 📈 数据集统计信息 - 共享查询集:**189万**条唯一提示词 - 两款数据集分别基于以下模型蒸馏得到回复: - **AM-Thinking-v1** - **Qwen3-235B-A22B** - 任务类别分布: - 通用对话:约41.8% - 数学推理:约29.5% - 代码生成:约17.1% - 其他类别(科学、指令遵循、对话):约11.6% 通用对话类别涵盖多轮对话及其他类型的数据。 ![alt text](AM-distilled.png) ✅ 校验与质量管控 所有输出均经过**自动化校验**,并针对不同任务类别采用定制化校验方法: - **数学任务**:采用Math-Verify工具进行二元判定(通过/不通过) - **代码任务**:在沙箱环境中基于测试用例进行验证 - **科学任务**:通过大语言模型(Large Language Model, LLM)评分计算答案相似度 - **指令遵循任务**:由`IFEval`校验器完成验证 - **通用对话任务**:使用奖励模型(例如Decision-Tree-Reward-Llama-3.1-8B)进行评估 两款数据集均单独执行以下质控流程: - 使用高性能32B大语言模型进行困惑度过滤 - N-gram重复内容过滤 - 结构格式校验(例如检查`<think>`与`<answer>`标签是否存在) ⚠️ 局限性 开发者需严格将本项目开源代码、数据集、模型及相关产物的使用范围限定于**科研用途**。**商业使用及任何可能造成危害的应用均被严格禁止**。 本数据集的内容不代表任何个人或机构的观点、立场或背书。作者不对因使用、不当使用或解读本数据集及相关材料所引发的后果承担任何责任。 📜 引用 若使用本数据集,请引用以下文献: @misc{tian2025correctanswersequaldistillation, title={Not All Correct Answers Are Equal: Why Your Distillation Source Matters}, author={Xiaoyu Tian and Yunjie Ji and Haotian Wang and Shuaiting Chen and Sitong Zhao and Yiping Peng and Han Zhao and Xiangang Li}, year={2025}, eprint={2505.14464}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.14464}, }
提供机构:
maas
创建时间:
2025-05-21
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作