DeepSeek-v3.1-reasoner-Distilled-math-samples
收藏魔搭社区2026-01-06 更新2025-08-30 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/DeepSeek-v3.1-reasoner-Distilled-math-samples
下载链接
链接失效反馈官方服务:
资源简介:
# DeepSeek-V3.1 Distillation with NVIDIA Nemotron-Post-Training-Dataset-v2 (Math Subset)
The release of **DeepSeek-V3.1** has attracted wide attention in the AI community. Its significant improvements in reasoning ability provide a new opportunity to explore optimization of domain-specific models. To investigate the potential of this model in complex mathematical reasoning tasks, I selected the **math subset** from NVIDIA’s newly released **Nemotron-Post-Training-Dataset-v2** as seed problems and performed knowledge distillation on the **DeepSeek-V3.1 Reasoner** mode. The results are also compared with the outputs from **DeepSeek-R1-0528**.
---
# 📊 Results & Visualizations
<p align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/D0BZ2OsTWPUUJi_fBf4Fn.png" alt="Text Length Analysis: histograms for question, question word count, reasoning, answer, seed answer(by R1-0528), and averaged character lengths" width="65%">
<br><em>Figure 1. Text length distributions for question, reasoning, answer, and seed answer.</em>
</p>
### 🔑 Key Takeaways
- **Average Length (characters)**: Question ≈ **227**, Reasoning ≈ **17,236**, Answer ≈ **2,143**, Seed Answer ≈ **1,669**.
- **Reasoning** shows a clear **long-tail distribution**, with a few samples containing extremely long chains of thought — consistent with the behavior of the “Reasoner mode” on complex math tasks.
---
<p align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/bZbfdrwnneI0N6NXlmsFm.png" alt="Token Usage Analysis: histograms for prompt/completion/total tokens and a scatter of prompt vs completion" width="65%">
<br><em>Figure 2. Prompt/completion/total token usage and their relationship.</em>
</p>
### 📌 Token Usage Highlights
- **Average Prompt Tokens ≈ 74.8**, **Average Completion Tokens ≈ 7,183**, **Average Total Tokens ≈ 7,258**.
- There is a **weak positive correlation** between prompt and completion length: longer problem statements/contexts usually lead to longer reasoning outputs, though the decisive factor remains task complexity.
- **Costs and latency** are driven almost entirely by **Completion length**; for reasoning tasks, optimizing output length and stability should be prioritized.
---
## Why NVIDIA’s Dataset?
The choice of NVIDIA’s dataset was based on several considerations:
- As the latest public dataset, the problems in version 2 are less likely to have appeared in widely available training corpora.
- According to NVIDIA’s technical report, the math subset was deliberately designed to emphasize evaluation of **mathematical reasoning** and **multi-step Chain-of-Thought** abilities.
- Each problem is preceded by a standardized prompt:
> *“Solve the following math problem. Make sure to put the answer (and only answer) inside \boxed{}...”*
While such formatting instructions are useful for **SFT** and **RL evaluation** to standardize outputs, during distillation they may lead to templated reasoning chains, reducing the diversity of generated data.
➡️ To mitigate this, I retained **50% of the template prompts** and removed the formatting instructions from the other half of the questions—helping to prevent overfitting and improve diversity.
---
## Data Processing
- Extracted reasoning chains and answers from **DeepSeek-V3.1**.
- Added comparison outputs from **DeepSeek-R1-0528** (answers only, without reasoning).
- Collected additional metadata such as token lengths of reasoning and answers from V3.1.
---
## Experiment Setup
- **API**: Official DeepSeek API
- **Hyperparameters**: Default (temperature, top_p, etc. unchanged)
- **Context Window**: 32K
---
## Limitations
- The distillation scale is relatively small.
- Since **DeepSeek-V3.1** is newly released, most third-party inference providers have not yet integrated the model.
- Reliance on the official API, which is **unstable** and **extremely slow** (even with concurrency and API key rotation).
- Only a small number of demonstration samples were distilled, followed by filtering and cleaning.
- My hosting servers are currently occupied with GPT-related projects, so **larger-scale, high-quality dataset construction** will be done once resources are available.
---
```bibtex
@misc{rong2025deepseekv31distill,
title = {DeepSeek-V3.1 Distillation with NVIDIA Nemotron-Post-Training-Dataset-v2 (Math Subset)},
author = {Rong, Jack},
year = {2025},
note = {Jackrong/DeepSeek-v3.1-reasoner-Distilled-math-samples},
url = {https://huggingface.co/datasets/Jackrong/DeepSeek-v3.1-reasoner-Distilled-math-samples}
}
```
# DeepSeek-V3.1 结合 NVIDIA Nemotron-Post-Training-Dataset-v2(数学子集)的知识蒸馏(Knowledge Distillation)
**DeepSeek-V3.1** 发布后在人工智能(AI)社区引发广泛关注,其在推理能力上的显著提升为探索领域专用模型的优化提供了新契机。为探究该模型在复杂数学推理任务中的潜力,我选取NVIDIA最新发布的**Nemotron-Post-Training-Dataset-v2**中的**数学子集**作为种子问题,并针对**DeepSeek-V3.1 Reasoner**模式开展知识蒸馏。同时将实验结果与**DeepSeek-R1-0528**的输出进行了对比。
---
# 📊 实验结果与可视化
<p align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/D0BZ2OsTWPUUJi_fBf4Fn.png" alt="文本长度分析:针对问题、问题词数、推理过程、答案、种子答案(由R1-0528生成)的平均字符长度直方图" width="65%">
<br><em>图1. 问题、推理过程、答案与种子答案的文本长度分布。</em>
</p>
### 🔑 核心结论
- **平均字符长度**:问题约227,推理过程约17236,答案约2143,种子答案约1669。
- **推理过程**呈现明显的**长尾分布**,少量样本包含极长的思维链(Chain-of-Thought)——这与“Reasoner模式”在复杂数学任务中的表现一致。
---
<p align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/bZbfdrwnneI0N6NXlmsFm.png" alt="Token使用分析:提示词/补全/总Token的直方图,以及提示词与补全的散点图" width="65%">
<br><em>图2. 提示词、补全、总Token的使用情况及其关联。</em>
</p>
### 📌 Token使用要点
- **平均提示词Token数约74.8**,**平均补全Token数约7183**,**平均总Token数约7258**。
- 提示词与补全长度存在**弱正相关**:更长的问题描述/上下文通常对应更长的推理输出,但任务复杂度仍是决定性因素。
- **成本与延迟几乎完全由补全长度决定**;针对推理任务,应优先优化输出长度与稳定性。
---
## 为何选择NVIDIA的数据集?
本次选用NVIDIA的数据集基于多方面考量:
- 作为最新的公开数据集,v2版本中的问题在广泛可用的训练语料库中出现的概率更低。
- 根据NVIDIA的技术报告,该数学子集的设计初衷是侧重评估**数学推理能力**与**多步思维链(Chain-of-Thought)**能力。
- 每个问题前均带有标准化提示词:
> *“请解决以下数学问题。确保将答案(仅答案)置于oxed{}中……”*
这类格式指令虽有助于**监督微调(Supervised Fine-Tuning, SFT)**与强化学习(RL)评估以统一输出格式,但在蒸馏过程中可能会催生模板化的推理链,降低生成数据的多样性。
➡️ 为缓解这一问题,我保留了**50%的模板提示词**,并移除了其余半数问题的格式指令——这有助于防止过拟合并提升数据多样性。
---
## 数据处理流程
- 从**DeepSeek-V3.1**中提取推理链与答案。
- 加入**DeepSeek-R1-0528**的对比输出(仅答案,不含推理过程)。
- 收集额外元数据,例如V3.1中推理过程与答案的Token长度。
---
## 实验设置
- **API**:DeepSeek官方API
- **超参数**:默认设置(温度、top_p等参数未作修改)
- **上下文窗口**:32K
---
## 局限性
- 本次蒸馏的规模相对较小。
- 由于**DeepSeek-V3.1**为新近发布的模型,多数第三方推理服务商尚未集成该模型。
- 依赖官方API,该API**不稳定且速度极慢**(即便采用并发请求与API密钥轮换策略)。
- 仅对少量演示样本进行了蒸馏,随后进行了筛选与清洗。
- 我的托管服务器目前正运行与GPT相关的项目,因此**大规模高质量数据集构建**将待资源释放后开展。
---
bibtex
@misc{rong2025deepseekv31distill,
title = {DeepSeek-V3.1 Distillation with NVIDIA Nemotron-Post-Training-Dataset-v2 (Math Subset)},
author = {Rong, Jack},
year = {2025},
note = {Jackrong/DeepSeek-v3.1-reasoner-Distilled-math-samples},
url = {https://huggingface.co/datasets/Jackrong/DeepSeek-v3.1-reasoner-Distilled-math-samples}
}
提供机构:
maas
创建时间:
2025-08-22



