GPT-OSS-20B-Distilled-Reasoning-Mini
收藏魔搭社区2026-01-06 更新2025-08-23 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/GPT-OSS-20B-Distilled-Reasoning-Mini
下载链接
链接失效反馈官方服务:
资源简介:

# Dataset Card for Dataset Name
GPT-OSS-20B Distilled Reasoning Dataset Mini
* (Multi-stage Evaluative Refinement Method for Reasoning Generation)
## Dataset Details and Description
This is a high-quality instruction fine-tuning dataset constructed through knowledge distillation, featuring detailed Chain-of-Thought (CoT) reasoning processes. The dataset is designed to enhance the capabilities of smaller language models in complex reasoning, logical analysis, and instruction following.
I generated this dataset through a meticulously designed three-stage process:
- **Seed Question Aggregation:** I extensively gathered and filtered seed questions from multiple high‑quality data sources, including self‑generated ones.
- **Knowledge Distillation & Cleaning:** I used `opeanai/gpt-oss-20b` (High) as the teacher model to generate in‑depth reasoning and responses for the seed questions, which then underwent a rigorous automated cleaning process.
- **LLM‑as‑a‑Judge Quality Assessment:** I employed `Qwen/Qwen3-235B-A22B-Instruct-2507` as the judge model to score the generated data across multiple dimensions, ensuring the quality of the final output.
- **Curated by:** [Jack Rong]
- **Funded by :** [Jack Rong]
- **Shared by :** [Jack Rong]
- **Language(s) (NLP):** [En]
- **License:** [apache-2.0]
- **Teacher Model**: `unsloth/gpt-oss-20b`
- **Judge Model**: `Qwen/Qwen3-235B-A22B-Instruct-2507`
#### Source Data
<div style="display: flex; gap: 10px; justify-content: space-around;">
<img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/hEWXvzYitkXka19sBs9k_.png" style="width: calc(33.33% - 10px);">
<img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/61KVV5AJeDrm5FpMrxZzC.png" style="width: calc(33.33% - 10px);">
<img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/YQ49_Kdu8kPY9VfWnAD0_.png" style="width: calc(33.33% - 10px);">
</div>
### Phase 1: Seed Question Aggregation
To ensure diversity and comprehensiveness in the distilled instruction sources, I employed a **multi-source aggregation strategy** to construct the seed question pool:
<img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/bm5_WgDdPTFnWFgGSBGTs.png" width="300">
- **Self-generated** (300 entries):
Using the **unsloth/Qwen3-4B-Instruct-2507** model, I generated **300 high-quality, representative initial questions** based on 7 predefined core thinking templates:
- Logical reasoning
- Basic mathematics
- Scientific knowledge
- Programming thinking
- Common sense
- Creative imagination
- Ethical dilemmas
- **External Dataset Integration**:
To further enhance coverage, we sampled from multiple **established public datasets**, including:
- `facebook/natural_reasoning`
- `vicgalle/alpaca-gpt4`
- `openai/gsm8k`
- `fka/awesome-chatgpt-prompts`
- `microsoft/rStar-Coder`
- `nvidia/OpenScience`
Prior to distillation, all seed questions underwent rigorous automated cleaning to filter entries with:
- Missing referenced attachments
- Non-English content
- Excessively short queries
This process resulted in **1,990 high-quality questions** for final distillation (from an initial pool of 1,994 entries).
### Phase 2: Distillation & Cleaning
<div style="display: flex; gap: 10px;">
<img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/P4LYLG5K6FNJ-xrHkFO0X.png" alt="Image 1" width="300">
<img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/0kgR7OMbtEF7SQJ1Z-T7w.png" alt="Image 2" width="300">
</div>

This phase forms the core of generating high-quality chain-of-thought (CoT) content. The entire workflow ensures that **both inputs and outputs undergo rigorous quality verification** – not merely cleaning the inputs, but systematically validating the outputs.
- **Distillation Generation**:
I employed `opeanai/gpt-oss-20b(High)` as the teacher model. For each seed question, a specially designed prompt template was used to explicitly instruct the model to:
(1) Conduct detailed **"Think step-by-step" chain-of-thought reasoning**, followed by
(2) Delivering the final answer.
This structured approach guarantees comprehensive reasoning before conclusion.
- **Post-Distillation Cleaning**:
Raw CoT outputs and answers from the teacher model were **not directly adopted**. Instead, we implemented an additional quality assessment and cleaning round to:
- Eliminate low-quality outputs caused by **model hallucinations**, **logical inconsistencies**, or **formatting errors**
- Filter entries failing factual coherence or structural integrity checks
This critical step ensures **every sample** in the final dataset contains **high-quality, reliable reasoning processes and answers**.
### Phase 3: LLM-as-a-Judge Quality Scoring
To quantitatively evaluate distilled data quality, we implemented an **LLM-as-a-Judge** assessment framework that measures multiple reasoning dimensions.
- **Judge Model**:
We utilized the more powerful `Qwen/Qwen3-235B-A22B-Instruct-2507` model as the judge (235B parameters specification maintained for exactness per OpenAI/Anthropic model documentation conventions).
- **Multi-Dimensional Evaluation Rubric**:
A comprehensive **three-aspect scoring system** was developed, requiring judge model to rate each entry on a 1-10 scale:
- **Input Score**: Measures question clarity, answerability, reasoning density, and contextual relevance
- **CoT Score**: Evaluates logical coherence, presence of **logical leaps**, constraint adherence, and alignment with final answer
- **Answer Score**: Assesses accuracy, completeness, **format adherence**, and robustness
- **Weighted Scoring Formula**:
To emphasize reasoning and answer quality, we applied a weighted average:
`Total Score = 0.15 × Input Score + 0.35 × CoT Score + 0.5 × Answer Score`
- **Scoring Results**:
After evaluating **1,946 entries** (excluding 44 failed samples), the dataset demonstrates **exceptional quality**, with most scores concentrated in the 8.0-10.0 range. This distribution statistically validates the effectiveness of our three-stage distillation process.
<div style="display: flex; gap: 12px; align-items: center;">
<img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/-5qm3NXBKM2q_Twg2DfHy.png"
alt="图1"
width="450"
style="object-fit: contain;">
<img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/W7G5ua95zMGVTyaGxbq9N.png"
alt="图2"
width="280"
style="object-fit: contain;">
</div>
<img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/_-c1UaAqYGn_2d81CPVtx.png"
alt="800px 图像"
width="650"
style="display:block; margin:auto;">
### Direct Use
This dataset is primarily designed for instruction fine-tuning of language models, specifically to enhance their chain-of-thought (CoT) generation and complex reasoning capabilities. It applies to the following scenarios:
- **Model Fine-Tuning**: Serves as high-quality training data for fine-tuning language models of various scales to improve their logical reasoning, step-by-step problem decomposition, and instruction-following abilities in English contexts.
- **Academic Research**: Enables investigation into knowledge distillation efficacy, CoT generation mechanisms, and the validity of LLM-as-a-Judge evaluation methodologies.
- **Model Evaluation**: Functions as a benchmark to assess model performance on complex reasoning tasks.
### Out-of-Scope Use
- **Critical Fact-Checking in Production**: Generated by AI models, this dataset may contain factual inaccuracies. It **shall not** be used in production environments requiring high factual precision (e.g., medical, legal, or financial advice) without rigorous fact verification.
- **Non-English Tasks**: Content is exclusively in English and **is not suitable** for direct fine-tuning of Chinese or other language models.
- **Malicious Applications**: **Strictly prohibited** for any harmful purposes, including but not limited to generating toxic, unethical, biased, or misleading content.
### Dataset Structure
The dataset is provided in `.jsonl` format, where each line constitutes a JSON object. No predefined train/validation/test splits are included; users are advised to perform random splits according to their specific requirements.
### Data Fields
- **id (string)**: Unique identifier for the sample.
- **category (string)**: Category to which the question belongs.
- **input (string)**: User's instruction or query.
- **CoT_content (string)**: Chain-of-thought reasoning process generated by the teacher model.
- **answer_content (string)**: Final answer generated by the teacher model.
- **judge_reviews (list)**: List containing review comments from judge models.
- **scores (dict)**: Dictionary containing specific scores for `input`, `cot`, and `answer` dimensions.
- **total_score (float)**: Final quality score calculated using a weighted formula.
- **num_judgments (int)**: Number of judgments (defaulting to 1).
### Data Instance
```json
{"id": "hf_openai_gsm8k_00173", "category": "gsm8k",
"input": "Mitchell has 30 pencils. He has 6 more pencils than Antonio. How many pencils does Mitchell and Antonio have together?",
"CoT_content": "<|channel|>analysis<|message|>We need to parse the problem: Mitchell has 30 pencils. He has 6 more pencils than Antonio. So Antonio's pencils = Mitchell - 6 = 30 - 6 = 24. Then together they have 30 + 24 = 54 pencils.\n\nCheck: If Mitchell has 30 and Antonio has 24, difference is indeed 6. So answer 54.\n\nThus final answer: 54 pencils.\n\nWe need to produce answer with \"Final answer:\" on a new line.",
"answer_content": "To find the total number of pencils:\n\n1. **Determine Antonio's pencils** \n Mitchell has 6 more than Antonio. \n \\[\n \\text{Antonio} = \\text{Mitchell} - 6 = 30 - 6 = 24\n \\]\n\n2. **Add their pencils together** \n \\[\n \\text{Total} = \\text{Mitchell} + \\text{Antonio} = 30 + 24 = 54\n \\]\n\nSo, Mitchell and Antonio have a total of **54 pencils**.\n\nFinal answer: 54",
"judge_reviews": [{"id": "hf_openai_gsm8k_00173", "input_review": "The input is clear, directly stated, and requires basic arithmetic reasoning with a well-defined solution path.", "cot_review": "The chain-of-thought is logically coherent, correctly interprets the relationship between Mitchell and Antonio's pencils, and verifies the result.", "answer_review": "The final answer is correct, clearly formatted, includes proper justification, and ends with 'Final answer: 54' as required.",
"scores": {"input": 10, "cot": 10, "answer": 10}}], "scores": {"input": 10, "cot": 10, "answer": 10},
"total_score": 10.0, "num_judgments": 1}
```
### Data Splits
This dataset **does not include predefined train/validation/test splits**. Users are **advised to perform random splits** according to their specific requirements to:
- Prevent data leakage during evaluation
- Maintain category distribution balance (critical for reasoning tasks)
- Align with downstream task constraints
# Acknowledgments
The creation of this dataset was made possible by the following outstanding open-source models and datasets. We extend our sincere gratitude:
- **Teacher Model**: `opeanai/gpt-oss-20b`
- **Judge Model**: `Qwen/Qwen3-235B-A22B-Instruct-2507`
- **Seed Question Sources**:
`facebook/natural_reasoning`,
`vicgalle/alpaca-gpt4`,
`openai/gsm8k`,
`fka/awesome-chatgpt-prompts`,
`microsoft/rStar-Coder`,
`nvidia/OpenScience`
```bibtex
@dataset{jackrong2025gptossdistilled,
author = {JACKRONG},
title = {Jackrong/GPT-OSS-20B-Distilled-Reasoning-Mini},
month = aug,
year = {2025},
publisher = {Hugging Face},
version = {1.0.0},
url = {https://huggingface.co/datasets/Jackrong/GPT-OSS-20B-Distilled-Reasoning-Mini}
}
```
# 数据集卡片:数据集名称
GPT-OSS-20B蒸馏推理迷你数据集
* (面向推理生成的多阶段评估优化方法)
## 数据集详情与说明
本数据集是通过知识蒸馏构建的高质量指令微调数据集,包含完整的思维链(Chain-of-Thought, CoT)推理过程,旨在提升小型语言模型在复杂推理、逻辑分析与指令遵循方面的能力。
我通过精心设计的三阶段流程构建了本数据集:
- **种子问题聚合**:我从包括自研在内的多个高质量数据源中广泛收集并筛选种子问题。
- **知识蒸馏与清洗**:我以`opeanai/gpt-oss-20b(High)`作为教师模型,为种子问题生成深度推理内容与回答,随后对生成结果执行严格的自动化清洗流程。
- **以大语言模型为裁判的质量评估**:我使用`Qwen/Qwen3-235B-A22B-Instruct-2507`作为裁判模型,从多个维度对生成的数据进行评分,以确保最终输出的质量。
- **数据整理者**:[Jack Rong]
- **资助方**:[Jack Rong]
- **共享方**:[Jack Rong]
- **自然语言处理语言**:[英语]
- **许可证**:[Apache-2.0]
- **教师模型**:`unsloth/gpt-oss-20b`
- **裁判模型**:`Qwen/Qwen3-235B-A22B-Instruct-2507`
#### 源数据
<div style="display: flex; gap: 10px; justify-content: space-around;">
<img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/hEWXvzYitkXka19sBs9k_.png" style="width: calc(33.33% - 10px);">
<img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/61KVV5AJeDrm5FpMrxZzC.png" style="width: calc(33.33% - 10px);">
<img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/YQ49_Kdu8kPY9VfWnAD0_.png" style="width: calc(33.33% - 10px);">
</div>
### 第一阶段:种子问题聚合
为确保蒸馏指令源的多样性与全面性,我采用**多源聚合策略**构建种子问题池:
<img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/bm5_WgDdPTFnWFgGSBGTs.png" width="300">
- **自研生成(300条)**:
使用**unsloth/Qwen3-4B-Instruct-2507**模型,我基于7种预设的核心思维模板生成了300条高质量、具有代表性的初始问题:
- 逻辑推理
- 基础数学
- 科学知识
- 编程思维
- 常识判断
- 创意想象
- 伦理困境
- **外部数据集整合**:
为进一步拓展覆盖范围,我们从多个成熟的公开数据集中采样,包括:
- `facebook/natural_reasoning`
- `vicgalle/alpaca-gpt4`
- `openai/gsm8k`
- `fka/awesome-chatgpt-prompts`
- `microsoft/rStar-Coder`
- `nvidia/OpenScience`
在蒸馏前,所有种子问题均经过严格的自动化清洗,过滤掉以下类型的条目:
- 缺失引用附件的条目
- 非英语内容条目
- 过短的查询条目
该流程从初始的1994条候选问题中筛选出1990条高质量问题,用于最终的蒸馏流程。
### 第二阶段:蒸馏与清洗
<div style="display: flex; gap: 10px;">
<img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/P4LYLG5K6FNJ-xrHkFO0X.png" alt="Image 1" width="300">
<img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/0kgR7OMbtEF7SQJ1Z-T7w.png" alt="Image 2" width="300">
</div>

本阶段是生成高质量思维链(CoT)内容的核心环节。整个工作流确保**输入与输出均经过严格的质量验证**——不仅对输入内容进行清洗,还会系统性地校验输出内容。
- **蒸馏生成**:
我以`opeanai/gpt-oss-20b(High)`作为教师模型。针对每个种子问题,我们使用专门设计的提示模板,明确要求模型完成以下两步:
(1) 执行详细的**“逐步思考”式思维链推理**,随后
(2) 给出最终答案
该结构化方法确保模型在得出结论前完成全面的推理过程。
- **蒸馏后清洗**:
教师模型生成的原始思维链输出与回答不会被直接采用。相反,我们会执行额外的质量评估与清洗流程,以:
- 剔除由**模型幻觉**、**逻辑矛盾**或**格式错误**导致的低质量输出
- 过滤不符合事实一致性或结构完整性要求的条目
该关键步骤确保最终数据集中的每一条样本都包含高质量、可靠的推理过程与答案。
### 第三阶段:以大语言模型为裁判的质量评分
为定量评估蒸馏后数据的质量,我们搭建了**以大语言模型为裁判**的评估框架,从多个推理维度进行打分。
- **裁判模型**:
我们使用性能更强的`Qwen/Qwen3-235B-A22B-Instruct-2507`模型作为裁判(遵循OpenAI/Anthropic模型文档的规范,保留235亿参数的准确标注)。
- **多维度评估准则**:
我们设计了一套完整的**三维评分体系**,要求裁判模型从1到10分对每条样本进行评分:
- **输入评分**:衡量问题清晰度、可解答性、推理密度与上下文相关性
- **思维链评分**:评估逻辑连贯性、**逻辑跳跃**的存在情况、约束遵循程度以及与最终答案的一致性
- **答案评分**:评估答案的准确性、完整性、**格式合规性**与鲁棒性
- **加权评分公式**:
为突出推理与答案质量的权重,我们采用加权平均公式:
`总评分 = 0.15 × 输入评分 + 0.35 × 思维链评分 + 0.5 × 答案评分`
- **评分结果**:
在对1946条样本(剔除44条失败样本)完成评估后,本数据集展现出**优异的质量**,绝大多数评分集中在8.0-10.0区间。该分布从统计学上验证了我们三阶段蒸馏流程的有效性。
<div style="display: flex; gap: 12px; align-items: center;">
<img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/-5qm3NXBKM2q_Twg2DfHy.png"
alt="图1"
width="450"
style="object-fit: contain;">
<img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/W7G5ua95zMGVTyaGxbq9N.png"
alt="图2"
width="280"
style="object-fit: contain;">
</div>
<img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/_-c1UaAqYGn_2d81CPVtx.png"
alt="800px 图像"
width="650"
style="display:block; margin:auto;">
### 直接使用场景
本数据集主要面向语言模型的指令微调任务,旨在提升模型的思维链(CoT)生成与复杂推理能力,适用于以下场景:
- **模型微调**:作为高质量训练数据,用于不同规模语言模型的微调,提升其在英语语境下的逻辑推理、逐步问题拆解与指令遵循能力。
- **学术研究**:可用于研究知识蒸馏的效果、思维链生成机制以及以大语言模型为裁判的评估方法的有效性。
- **模型评估**:可作为基准数据集,评估模型在复杂推理任务上的性能。
### 禁用使用场景
- **生产环境中的关键事实核查**:本数据集由AI模型生成,可能包含事实错误。未经严格的事实校验,**不得**用于对事实精度要求极高的生产环境(如医疗、法律或金融咨询场景)。
- **非英语任务**:本数据集内容仅为英语,**不适用于**中文或其他语言模型的直接微调。
- **恶意应用**:**严格禁止**用于任何有害用途,包括但不限于生成有毒、不道德、带有偏见或误导性的内容。
### 数据集结构
本数据集以`.jsonl`格式提供,每行均为一个JSON对象。数据集未包含预设的训练/验证/测试划分;建议用户根据自身需求进行随机划分。
### 数据字段
- **id(字符串类型)**:样本的唯一标识符。
- **category(字符串类型)**:问题所属的类别。
- **input(字符串类型)**:用户的指令或查询。
- **CoT_content(字符串类型)**:教师模型生成的思维链推理过程。
- **answer_content(字符串类型)**:教师模型生成的最终答案。
- **judge_reviews(列表类型)**:包含裁判模型评审意见的列表。
- **scores(字典类型)**:包含`input`、`cot`与`answer`三个维度具体评分的字典。
- **total_score(浮点类型)**:通过加权公式计算得到的最终质量评分。
- **num_judgments(整数类型)**:评审次数(默认值为1)。
### 数据示例
json
{"id": "hf_openai_gsm8k_00173", "category": "gsm8k",
"input": "Mitchell has 30 pencils. He has 6 more pencils than Antonio. How many pencils does Mitchell and Antonio have together?",
"CoT_content": "<|channel|>analysis<|message|>We need to parse the problem: Mitchell has 30 pencils. He has 6 more pencils than Antonio. So Antonio's pencils = Mitchell - 6 = 30 - 6 = 24. Then together they have 30 + 24 = 54 pencils.
Check: If Mitchell has 30 and Antonio has 24, difference is indeed 6. So answer 54.
Thus final answer: 54 pencils.
We need to produce answer with "Final answer:" on a new line.",
"answer_content": "To find the total number of pencils:
1. **Determine Antonio's pencils**
Mitchell has 6 more than Antonio.
\[
ext{Antonio} = ext{Mitchell} - 6 = 30 - 6 = 24
\]
2. **Add their pencils together**
\[
ext{Total} = ext{Mitchell} + ext{Antonio} = 30 + 24 = 54
\]
So, Mitchell and Antonio have a total of **54 pencils**.
Final answer: 54",
"judge_reviews": [{"id": "hf_openai_gsm8k_00173", "input_review": "The input is clear, directly stated, and requires basic arithmetic reasoning with a well-defined solution path.", "cot_review": "The chain-of-thought is logically coherent, correctly interprets the relationship between Mitchell and Antonio's pencils, and verifies the result.", "answer_review": "The final answer is correct, clearly formatted, includes proper justification, and ends with 'Final answer: 54' as required.",
"scores": {"input": 10, "cot": 10, "answer": 10}}], "scores": {"input": 10, "cot": 10, "answer": 10},
"total_score": 10.0, "num_judgments": 1}
### 数据划分
本数据集**未包含预设的训练/验证/测试划分**。建议用户根据自身需求进行随机划分,以实现以下目标:
- 避免评估过程中的数据泄露
- 保持类别分布的平衡性(对推理任务至关重要)
- 适配下游任务的约束条件
# 致谢
本数据集的构建离不开以下优秀的开源模型与数据集,在此致以诚挚的谢意:
- **教师模型**:`opeanai/gpt-oss-20b`
- **裁判模型**:`Qwen/Qwen3-235B-A22B-Instruct-2507`
- **种子问题数据源**:
`facebook/natural_reasoning`,
`vicgalle/alpaca-gpt4`,
`openai/gsm8k`,
`fka/awesome-chatgpt-prompts`,
`microsoft/rStar-Coder`,
`nvidia/OpenScience`
bibtex
@dataset{jackrong2025gptossdistilled,
author = {JACKRONG},
title = {Jackrong/GPT-OSS-20B-Distilled-Reasoning-Mini},
month = aug,
year = {2025},
publisher = {Hugging Face},
version = {1.0.0},
url = {https://huggingface.co/datasets/Jackrong/GPT-OSS-20B-Distilled-Reasoning-Mini}
}
提供机构:
maas
创建时间:
2025-08-19



