GPT-OSS-20B-Distilled-Reasoning-Mini

Name: GPT-OSS-20B-Distilled-Reasoning-Mini
Creator: maas
Published: 2026-01-06 16:43:26
License: 暂无描述

魔搭社区2026-01-06 更新2025-08-23 收录

下载链接：

https://modelscope.cn/datasets/AI-ModelScope/GPT-OSS-20B-Distilled-Reasoning-Mini

下载链接

链接失效反馈

官方服务：

资源简介：

![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/yuc6SULgg2DNRDJQrGB5l.jpeg) # Dataset Card for Dataset Name GPT-OSS-20B Distilled Reasoning Dataset Mini * (Multi-stage Evaluative Refinement Method for Reasoning Generation) ## Dataset Details and Description This is a high-quality instruction fine-tuning dataset constructed through knowledge distillation, featuring detailed Chain-of-Thought (CoT) reasoning processes. The dataset is designed to enhance the capabilities of smaller language models in complex reasoning, logical analysis, and instruction following. I generated this dataset through a meticulously designed three-stage process: - **Seed Question Aggregation:** I extensively gathered and filtered seed questions from multiple high‑quality data sources, including self‑generated ones. - **Knowledge Distillation & Cleaning:** I used `opeanai/gpt-oss-20b` (High) as the teacher model to generate in‑depth reasoning and responses for the seed questions, which then underwent a rigorous automated cleaning process. - **LLM‑as‑a‑Judge Quality Assessment:** I employed `Qwen/Qwen3-235B-A22B-Instruct-2507` as the judge model to score the generated data across multiple dimensions, ensuring the quality of the final output. - **Curated by:** [Jack Rong] - **Funded by :** [Jack Rong] - **Shared by :** [Jack Rong] - **Language(s) (NLP):** [En] - **License:** [apache-2.0] - **Teacher Model**: `unsloth/gpt-oss-20b` - **Judge Model**: `Qwen/Qwen3-235B-A22B-Instruct-2507` #### Source Data <div style="display: flex; gap: 10px; justify-content: space-around;"> <img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/hEWXvzYitkXka19sBs9k_.png" style="width: calc(33.33% - 10px);"> <img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/61KVV5AJeDrm5FpMrxZzC.png" style="width: calc(33.33% - 10px);"> <img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/YQ49_Kdu8kPY9VfWnAD0_.png" style="width: calc(33.33% - 10px);"> </div> ### Phase 1: Seed Question Aggregation To ensure diversity and comprehensiveness in the distilled instruction sources, I employed a **multi-source aggregation strategy** to construct the seed question pool: <img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/bm5_WgDdPTFnWFgGSBGTs.png" width="300"> - **Self-generated** (300 entries): Using the **unsloth/Qwen3-4B-Instruct-2507** model, I generated **300 high-quality, representative initial questions** based on 7 predefined core thinking templates: - Logical reasoning - Basic mathematics - Scientific knowledge - Programming thinking - Common sense - Creative imagination - Ethical dilemmas - **External Dataset Integration**: To further enhance coverage, we sampled from multiple **established public datasets**, including: - `facebook/natural_reasoning` - `vicgalle/alpaca-gpt4` - `openai/gsm8k` - `fka/awesome-chatgpt-prompts` - `microsoft/rStar-Coder` - `nvidia/OpenScience` Prior to distillation, all seed questions underwent rigorous automated cleaning to filter entries with: - Missing referenced attachments - Non-English content - Excessively short queries This process resulted in **1,990 high-quality questions** for final distillation (from an initial pool of 1,994 entries). ### Phase 2: Distillation & Cleaning <div style="display: flex; gap: 10px;"> <img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/P4LYLG5K6FNJ-xrHkFO0X.png" alt="Image 1" width="300"> <img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/0kgR7OMbtEF7SQJ1Z-T7w.png" alt="Image 2" width="300"> </div> ![image/png](https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/knsldQhMFoy4eyF-X3qi-.png) This phase forms the core of generating high-quality chain-of-thought (CoT) content. The entire workflow ensures that **both inputs and outputs undergo rigorous quality verification** – not merely cleaning the inputs, but systematically validating the outputs. - **Distillation Generation**: I employed `opeanai/gpt-oss-20b(High)` as the teacher model. For each seed question, a specially designed prompt template was used to explicitly instruct the model to: (1) Conduct detailed **"Think step-by-step" chain-of-thought reasoning**, followed by (2) Delivering the final answer. This structured approach guarantees comprehensive reasoning before conclusion. - **Post-Distillation Cleaning**: Raw CoT outputs and answers from the teacher model were **not directly adopted**. Instead, we implemented an additional quality assessment and cleaning round to: - Eliminate low-quality outputs caused by **model hallucinations**, **logical inconsistencies**, or **formatting errors** - Filter entries failing factual coherence or structural integrity checks This critical step ensures **every sample** in the final dataset contains **high-quality, reliable reasoning processes and answers**. ### Phase 3: LLM-as-a-Judge Quality Scoring To quantitatively evaluate distilled data quality, we implemented an **LLM-as-a-Judge** assessment framework that measures multiple reasoning dimensions. - **Judge Model**: We utilized the more powerful `Qwen/Qwen3-235B-A22B-Instruct-2507` model as the judge (235B parameters specification maintained for exactness per OpenAI/Anthropic model documentation conventions). - **Multi-Dimensional Evaluation Rubric**: A comprehensive **three-aspect scoring system** was developed, requiring judge model to rate each entry on a 1-10 scale: - **Input Score**: Measures question clarity, answerability, reasoning density, and contextual relevance - **CoT Score**: Evaluates logical coherence, presence of **logical leaps**, constraint adherence, and alignment with final answer - **Answer Score**: Assesses accuracy, completeness, **format adherence**, and robustness - **Weighted Scoring Formula**: To emphasize reasoning and answer quality, we applied a weighted average: `Total Score = 0.15 × Input Score + 0.35 × CoT Score + 0.5 × Answer Score` - **Scoring Results**: After evaluating **1,946 entries** (excluding 44 failed samples), the dataset demonstrates **exceptional quality**, with most scores concentrated in the 8.0-10.0 range. This distribution statistically validates the effectiveness of our three-stage distillation process. <div style="display: flex; gap: 12px; align-items: center;"> <img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/-5qm3NXBKM2q_Twg2DfHy.png" alt="图1" width="450" style="object-fit: contain;"> <img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/W7G5ua95zMGVTyaGxbq9N.png" alt="图2" width="280" style="object-fit: contain;"> </div> <img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/_-c1UaAqYGn_2d81CPVtx.png" alt="800px 图像" width="650" style="display:block; margin:auto;"> ### Direct Use This dataset is primarily designed for instruction fine-tuning of language models, specifically to enhance their chain-of-thought (CoT) generation and complex reasoning capabilities. It applies to the following scenarios: - **Model Fine-Tuning**: Serves as high-quality training data for fine-tuning language models of various scales to improve their logical reasoning, step-by-step problem decomposition, and instruction-following abilities in English contexts. - **Academic Research**: Enables investigation into knowledge distillation efficacy, CoT generation mechanisms, and the validity of LLM-as-a-Judge evaluation methodologies. - **Model Evaluation**: Functions as a benchmark to assess model performance on complex reasoning tasks. ### Out-of-Scope Use - **Critical Fact-Checking in Production**: Generated by AI models, this dataset may contain factual inaccuracies. It **shall not** be used in production environments requiring high factual precision (e.g., medical, legal, or financial advice) without rigorous fact verification. - **Non-English Tasks**: Content is exclusively in English and **is not suitable** for direct fine-tuning of Chinese or other language models. - **Malicious Applications**: **Strictly prohibited** for any harmful purposes, including but not limited to generating toxic, unethical, biased, or misleading content. ### Dataset Structure The dataset is provided in `.jsonl` format, where each line constitutes a JSON object. No predefined train/validation/test splits are included; users are advised to perform random splits according to their specific requirements. ### Data Fields - **id (string)**: Unique identifier for the sample. - **category (string)**: Category to which the question belongs. - **input (string)**: User's instruction or query. - **CoT_content (string)**: Chain-of-thought reasoning process generated by the teacher model. - **answer_content (string)**: Final answer generated by the teacher model. - **judge_reviews (list)**: List containing review comments from judge models. - **scores (dict)**: Dictionary containing specific scores for `input`, `cot`, and `answer` dimensions. - **total_score (float)**: Final quality score calculated using a weighted formula. - **num_judgments (int)**: Number of judgments (defaulting to 1). ### Data Instance ```json {"id": "hf_openai_gsm8k_00173", "category": "gsm8k", "input": "Mitchell has 30 pencils. He has 6 more pencils than Antonio. How many pencils does Mitchell and Antonio have together?", "CoT_content": "<|channel|>analysis<|message|>We need to parse the problem: Mitchell has 30 pencils. He has 6 more pencils than Antonio. So Antonio's pencils = Mitchell - 6 = 30 - 6 = 24. Then together they have 30 + 24 = 54 pencils.\n\nCheck: If Mitchell has 30 and Antonio has 24, difference is indeed 6. So answer 54.\n\nThus final answer: 54 pencils.\n\nWe need to produce answer with \"Final answer:\" on a new line.", "answer_content": "To find the total number of pencils:\n\n1. **Determine Antonio's pencils** \n Mitchell has 6 more than Antonio. \n \\[\n \\text{Antonio} = \\text{Mitchell} - 6 = 30 - 6 = 24\n \\]\n\n2. **Add their pencils together** \n \\[\n \\text{Total} = \\text{Mitchell} + \\text{Antonio} = 30 + 24 = 54\n \\]\n\nSo, Mitchell and Antonio have a total of **54 pencils**.\n\nFinal answer: 54", "judge_reviews": [{"id": "hf_openai_gsm8k_00173", "input_review": "The input is clear, directly stated, and requires basic arithmetic reasoning with a well-defined solution path.", "cot_review": "The chain-of-thought is logically coherent, correctly interprets the relationship between Mitchell and Antonio's pencils, and verifies the result.", "answer_review": "The final answer is correct, clearly formatted, includes proper justification, and ends with 'Final answer: 54' as required.", "scores": {"input": 10, "cot": 10, "answer": 10}}], "scores": {"input": 10, "cot": 10, "answer": 10}, "total_score": 10.0, "num_judgments": 1} ``` ### Data Splits This dataset **does not include predefined train/validation/test splits**. Users are **advised to perform random splits** according to their specific requirements to: - Prevent data leakage during evaluation - Maintain category distribution balance (critical for reasoning tasks) - Align with downstream task constraints # Acknowledgments The creation of this dataset was made possible by the following outstanding open-source models and datasets. We extend our sincere gratitude: - **Teacher Model**: `opeanai/gpt-oss-20b` - **Judge Model**: `Qwen/Qwen3-235B-A22B-Instruct-2507` - **Seed Question Sources**: `facebook/natural_reasoning`, `vicgalle/alpaca-gpt4`, `openai/gsm8k`, `fka/awesome-chatgpt-prompts`, `microsoft/rStar-Coder`, `nvidia/OpenScience` ```bibtex @dataset{jackrong2025gptossdistilled, author = {JACKRONG}, title = {Jackrong/GPT-OSS-20B-Distilled-Reasoning-Mini}, month = aug, year = {2025}, publisher = {Hugging Face}, version = {1.0.0}, url = {https://huggingface.co/datasets/Jackrong/GPT-OSS-20B-Distilled-Reasoning-Mini} } ```

# 数据集卡片：数据集名称 GPT-OSS-20B蒸馏推理迷你数据集 * (面向推理生成的多阶段评估优化方法) ## 数据集详情与说明本数据集是通过知识蒸馏构建的高质量指令微调数据集，包含完整的思维链（Chain-of-Thought, CoT）推理过程，旨在提升小型语言模型在复杂推理、逻辑分析与指令遵循方面的能力。我通过精心设计的三阶段流程构建了本数据集： - **种子问题聚合**：我从包括自研在内的多个高质量数据源中广泛收集并筛选种子问题。 - **知识蒸馏与清洗**：我以`opeanai/gpt-oss-20b（High）`作为教师模型，为种子问题生成深度推理内容与回答，随后对生成结果执行严格的自动化清洗流程。 - **以大语言模型为裁判的质量评估**：我使用`Qwen/Qwen3-235B-A22B-Instruct-2507`作为裁判模型，从多个维度对生成的数据进行评分，以确保最终输出的质量。 - **数据整理者**：[Jack Rong] - **资助方**：[Jack Rong] - **共享方**：[Jack Rong] - **自然语言处理语言**：[英语] - **许可证**：[Apache-2.0] - **教师模型**：`unsloth/gpt-oss-20b` - **裁判模型**：`Qwen/Qwen3-235B-A22B-Instruct-2507` #### 源数据 <div style="display: flex; gap: 10px; justify-content: space-around;"> <img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/hEWXvzYitkXka19sBs9k_.png" style="width: calc(33.33% - 10px);"> <img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/61KVV5AJeDrm5FpMrxZzC.png" style="width: calc(33.33% - 10px);"> <img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/YQ49_Kdu8kPY9VfWnAD0_.png" style="width: calc(33.33% - 10px);"> </div> ### 第一阶段：种子问题聚合为确保蒸馏指令源的多样性与全面性，我采用**多源聚合策略**构建种子问题池： <img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/bm5_WgDdPTFnWFgGSBGTs.png" width="300"> - **自研生成（300条）**：使用**unsloth/Qwen3-4B-Instruct-2507**模型，我基于7种预设的核心思维模板生成了300条高质量、具有代表性的初始问题： - 逻辑推理 - 基础数学 - 科学知识 - 编程思维 - 常识判断 - 创意想象 - 伦理困境 - **外部数据集整合**：为进一步拓展覆盖范围，我们从多个成熟的公开数据集中采样，包括： - `facebook/natural_reasoning` - `vicgalle/alpaca-gpt4` - `openai/gsm8k` - `fka/awesome-chatgpt-prompts` - `microsoft/rStar-Coder` - `nvidia/OpenScience` 在蒸馏前，所有种子问题均经过严格的自动化清洗，过滤掉以下类型的条目： - 缺失引用附件的条目 - 非英语内容条目 - 过短的查询条目该流程从初始的1994条候选问题中筛选出1990条高质量问题，用于最终的蒸馏流程。 ### 第二阶段：蒸馏与清洗 <div style="display: flex; gap: 10px;"> <img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/P4LYLG5K6FNJ-xrHkFO0X.png" alt="Image 1" width="300"> <img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/0kgR7OMbtEF7SQJ1Z-T7w.png" alt="Image 2" width="300"> </div> ![image/png](https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/knsldQhMFoy4eyF-X3qi-.png) 本阶段是生成高质量思维链（CoT）内容的核心环节。整个工作流确保**输入与输出均经过严格的质量验证**——不仅对输入内容进行清洗，还会系统性地校验输出内容。 - **蒸馏生成**：我以`opeanai/gpt-oss-20b(High)`作为教师模型。针对每个种子问题，我们使用专门设计的提示模板，明确要求模型完成以下两步： (1) 执行详细的**“逐步思考”式思维链推理**，随后 (2) 给出最终答案该结构化方法确保模型在得出结论前完成全面的推理过程。 - **蒸馏后清洗**：教师模型生成的原始思维链输出与回答不会被直接采用。相反，我们会执行额外的质量评估与清洗流程，以： - 剔除由**模型幻觉**、**逻辑矛盾**或**格式错误**导致的低质量输出 - 过滤不符合事实一致性或结构完整性要求的条目该关键步骤确保最终数据集中的每一条样本都包含高质量、可靠的推理过程与答案。 ### 第三阶段：以大语言模型为裁判的质量评分为定量评估蒸馏后数据的质量，我们搭建了**以大语言模型为裁判**的评估框架，从多个推理维度进行打分。 - **裁判模型**：我们使用性能更强的`Qwen/Qwen3-235B-A22B-Instruct-2507`模型作为裁判（遵循OpenAI/Anthropic模型文档的规范，保留235亿参数的准确标注）。 - **多维度评估准则**：我们设计了一套完整的**三维评分体系**，要求裁判模型从1到10分对每条样本进行评分： - **输入评分**：衡量问题清晰度、可解答性、推理密度与上下文相关性 - **思维链评分**：评估逻辑连贯性、**逻辑跳跃**的存在情况、约束遵循程度以及与最终答案的一致性 - **答案评分**：评估答案的准确性、完整性、**格式合规性**与鲁棒性 - **加权评分公式**：为突出推理与答案质量的权重，我们采用加权平均公式： `总评分 = 0.15 × 输入评分 + 0.35 × 思维链评分 + 0.5 × 答案评分` - **评分结果**：在对1946条样本（剔除44条失败样本）完成评估后，本数据集展现出**优异的质量**，绝大多数评分集中在8.0-10.0区间。该分布从统计学上验证了我们三阶段蒸馏流程的有效性。 <div style="display: flex; gap: 12px; align-items: center;"> <img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/-5qm3NXBKM2q_Twg2DfHy.png" alt="图1" width="450" style="object-fit: contain;"> <img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/W7G5ua95zMGVTyaGxbq9N.png" alt="图2" width="280" style="object-fit: contain;"> </div> <img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/_-c1UaAqYGn_2d81CPVtx.png" alt="800px 图像" width="650" style="display:block; margin:auto;"> ### 直接使用场景本数据集主要面向语言模型的指令微调任务，旨在提升模型的思维链（CoT）生成与复杂推理能力，适用于以下场景： - **模型微调**：作为高质量训练数据，用于不同规模语言模型的微调，提升其在英语语境下的逻辑推理、逐步问题拆解与指令遵循能力。 - **学术研究**：可用于研究知识蒸馏的效果、思维链生成机制以及以大语言模型为裁判的评估方法的有效性。 - **模型评估**：可作为基准数据集，评估模型在复杂推理任务上的性能。 ### 禁用使用场景 - **生产环境中的关键事实核查**：本数据集由AI模型生成，可能包含事实错误。未经严格的事实校验，**不得**用于对事实精度要求极高的生产环境（如医疗、法律或金融咨询场景）。 - **非英语任务**：本数据集内容仅为英语，**不适用于**中文或其他语言模型的直接微调。 - **恶意应用**：**严格禁止**用于任何有害用途，包括但不限于生成有毒、不道德、带有偏见或误导性的内容。 ### 数据集结构本数据集以`.jsonl`格式提供，每行均为一个JSON对象。数据集未包含预设的训练/验证/测试划分；建议用户根据自身需求进行随机划分。 ### 数据字段 - **id（字符串类型）**：样本的唯一标识符。 - **category（字符串类型）**：问题所属的类别。 - **input（字符串类型）**：用户的指令或查询。 - **CoT_content（字符串类型）**：教师模型生成的思维链推理过程。 - **answer_content（字符串类型）**：教师模型生成的最终答案。 - **judge_reviews（列表类型）**：包含裁判模型评审意见的列表。 - **scores（字典类型）**：包含`input`、`cot`与`answer`三个维度具体评分的字典。 - **total_score（浮点类型）**：通过加权公式计算得到的最终质量评分。 - **num_judgments（整数类型）**：评审次数（默认值为1）。 ### 数据示例 json {"id": "hf_openai_gsm8k_00173", "category": "gsm8k", "input": "Mitchell has 30 pencils. He has 6 more pencils than Antonio. How many pencils does Mitchell and Antonio have together?", "CoT_content": "<|channel|>analysis<|message|>We need to parse the problem: Mitchell has 30 pencils. He has 6 more pencils than Antonio. So Antonio's pencils = Mitchell - 6 = 30 - 6 = 24. Then together they have 30 + 24 = 54 pencils. Check: If Mitchell has 30 and Antonio has 24, difference is indeed 6. So answer 54. Thus final answer: 54 pencils. We need to produce answer with "Final answer:" on a new line.", "answer_content": "To find the total number of pencils: 1. **Determine Antonio's pencils** Mitchell has 6 more than Antonio. \[ ext{Antonio} = ext{Mitchell} - 6 = 30 - 6 = 24 \] 2. **Add their pencils together** \[ ext{Total} = ext{Mitchell} + ext{Antonio} = 30 + 24 = 54 \] So, Mitchell and Antonio have a total of **54 pencils**. Final answer: 54", "judge_reviews": [{"id": "hf_openai_gsm8k_00173", "input_review": "The input is clear, directly stated, and requires basic arithmetic reasoning with a well-defined solution path.", "cot_review": "The chain-of-thought is logically coherent, correctly interprets the relationship between Mitchell and Antonio's pencils, and verifies the result.", "answer_review": "The final answer is correct, clearly formatted, includes proper justification, and ends with 'Final answer: 54' as required.", "scores": {"input": 10, "cot": 10, "answer": 10}}], "scores": {"input": 10, "cot": 10, "answer": 10}, "total_score": 10.0, "num_judgments": 1} ### 数据划分本数据集**未包含预设的训练/验证/测试划分**。建议用户根据自身需求进行随机划分，以实现以下目标： - 避免评估过程中的数据泄露 - 保持类别分布的平衡性（对推理任务至关重要） - 适配下游任务的约束条件 # 致谢本数据集的构建离不开以下优秀的开源模型与数据集，在此致以诚挚的谢意： - **教师模型**：`opeanai/gpt-oss-20b` - **裁判模型**：`Qwen/Qwen3-235B-A22B-Instruct-2507` - **种子问题数据源**： `facebook/natural_reasoning`, `vicgalle/alpaca-gpt4`, `openai/gsm8k`, `fka/awesome-chatgpt-prompts`, `microsoft/rStar-Coder`, `nvidia/OpenScience` bibtex @dataset{jackrong2025gptossdistilled, author = {JACKRONG}, title = {Jackrong/GPT-OSS-20B-Distilled-Reasoning-Mini}, month = aug, year = {2025}, publisher = {Hugging Face}, version = {1.0.0}, url = {https://huggingface.co/datasets/Jackrong/GPT-OSS-20B-Distilled-Reasoning-Mini} }

提供机构：

maas

创建时间：

2025-08-19

5,000+

优质数据集

54 个

任务类型

进入经典数据集