five

GPT-OSS-20B-Distilled-Reasoning-Mini

收藏
魔搭社区2026-01-06 更新2025-08-23 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/GPT-OSS-20B-Distilled-Reasoning-Mini
下载链接
链接失效反馈
官方服务:
资源简介:
![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/yuc6SULgg2DNRDJQrGB5l.jpeg) # Dataset Card for Dataset Name GPT-OSS-20B Distilled Reasoning Dataset Mini * (Multi-stage Evaluative Refinement Method for Reasoning Generation) ## Dataset Details and Description This is a high-quality instruction fine-tuning dataset constructed through knowledge distillation, featuring detailed Chain-of-Thought (CoT) reasoning processes. The dataset is designed to enhance the capabilities of smaller language models in complex reasoning, logical analysis, and instruction following. I generated this dataset through a meticulously designed three-stage process: - **Seed Question Aggregation:** I extensively gathered and filtered seed questions from multiple high‑quality data sources, including self‑generated ones. - **Knowledge Distillation & Cleaning:** I used `opeanai/gpt-oss-20b` (High) as the teacher model to generate in‑depth reasoning and responses for the seed questions, which then underwent a rigorous automated cleaning process. - **LLM‑as‑a‑Judge Quality Assessment:** I employed `Qwen/Qwen3-235B-A22B-Instruct-2507` as the judge model to score the generated data across multiple dimensions, ensuring the quality of the final output. - **Curated by:** [Jack Rong] - **Funded by :** [Jack Rong] - **Shared by :** [Jack Rong] - **Language(s) (NLP):** [En] - **License:** [apache-2.0] - **Teacher Model**: `unsloth/gpt-oss-20b` - **Judge Model**: `Qwen/Qwen3-235B-A22B-Instruct-2507` #### Source Data <div style="display: flex; gap: 10px; justify-content: space-around;"> <img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/hEWXvzYitkXka19sBs9k_.png" style="width: calc(33.33% - 10px);"> <img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/61KVV5AJeDrm5FpMrxZzC.png" style="width: calc(33.33% - 10px);"> <img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/YQ49_Kdu8kPY9VfWnAD0_.png" style="width: calc(33.33% - 10px);"> </div> ### Phase 1: Seed Question Aggregation To ensure diversity and comprehensiveness in the distilled instruction sources, I employed a **multi-source aggregation strategy** to construct the seed question pool: <img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/bm5_WgDdPTFnWFgGSBGTs.png" width="300"> - **Self-generated** (300 entries): Using the **unsloth/Qwen3-4B-Instruct-2507** model, I generated **300 high-quality, representative initial questions** based on 7 predefined core thinking templates: - Logical reasoning - Basic mathematics - Scientific knowledge - Programming thinking - Common sense - Creative imagination - Ethical dilemmas - **External Dataset Integration**: To further enhance coverage, we sampled from multiple **established public datasets**, including: - `facebook/natural_reasoning` - `vicgalle/alpaca-gpt4` - `openai/gsm8k` - `fka/awesome-chatgpt-prompts` - `microsoft/rStar-Coder` - `nvidia/OpenScience` Prior to distillation, all seed questions underwent rigorous automated cleaning to filter entries with: - Missing referenced attachments - Non-English content - Excessively short queries This process resulted in **1,990 high-quality questions** for final distillation (from an initial pool of 1,994 entries). ### Phase 2: Distillation & Cleaning <div style="display: flex; gap: 10px;"> <img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/P4LYLG5K6FNJ-xrHkFO0X.png" alt="Image 1" width="300"> <img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/0kgR7OMbtEF7SQJ1Z-T7w.png" alt="Image 2" width="300"> </div> ![image/png](https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/knsldQhMFoy4eyF-X3qi-.png) This phase forms the core of generating high-quality chain-of-thought (CoT) content. The entire workflow ensures that **both inputs and outputs undergo rigorous quality verification** – not merely cleaning the inputs, but systematically validating the outputs. - **Distillation Generation**: I employed `opeanai/gpt-oss-20b(High)` as the teacher model. For each seed question, a specially designed prompt template was used to explicitly instruct the model to: (1) Conduct detailed **"Think step-by-step" chain-of-thought reasoning**, followed by (2) Delivering the final answer. This structured approach guarantees comprehensive reasoning before conclusion. - **Post-Distillation Cleaning**: Raw CoT outputs and answers from the teacher model were **not directly adopted**. Instead, we implemented an additional quality assessment and cleaning round to: - Eliminate low-quality outputs caused by **model hallucinations**, **logical inconsistencies**, or **formatting errors** - Filter entries failing factual coherence or structural integrity checks This critical step ensures **every sample** in the final dataset contains **high-quality, reliable reasoning processes and answers**. ### Phase 3: LLM-as-a-Judge Quality Scoring To quantitatively evaluate distilled data quality, we implemented an **LLM-as-a-Judge** assessment framework that measures multiple reasoning dimensions. - **Judge Model**: We utilized the more powerful `Qwen/Qwen3-235B-A22B-Instruct-2507` model as the judge (235B parameters specification maintained for exactness per OpenAI/Anthropic model documentation conventions). - **Multi-Dimensional Evaluation Rubric**: A comprehensive **three-aspect scoring system** was developed, requiring judge model to rate each entry on a 1-10 scale: - **Input Score**: Measures question clarity, answerability, reasoning density, and contextual relevance - **CoT Score**: Evaluates logical coherence, presence of **logical leaps**, constraint adherence, and alignment with final answer - **Answer Score**: Assesses accuracy, completeness, **format adherence**, and robustness - **Weighted Scoring Formula**: To emphasize reasoning and answer quality, we applied a weighted average: `Total Score = 0.15 × Input Score + 0.35 × CoT Score + 0.5 × Answer Score` - **Scoring Results**: After evaluating **1,946 entries** (excluding 44 failed samples), the dataset demonstrates **exceptional quality**, with most scores concentrated in the 8.0-10.0 range. This distribution statistically validates the effectiveness of our three-stage distillation process. <div style="display: flex; gap: 12px; align-items: center;"> <img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/-5qm3NXBKM2q_Twg2DfHy.png" alt="图1" width="450" style="object-fit: contain;"> <img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/W7G5ua95zMGVTyaGxbq9N.png" alt="图2" width="280" style="object-fit: contain;"> </div> <img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/_-c1UaAqYGn_2d81CPVtx.png" alt="800px 图像" width="650" style="display:block; margin:auto;"> ### Direct Use This dataset is primarily designed for instruction fine-tuning of language models, specifically to enhance their chain-of-thought (CoT) generation and complex reasoning capabilities. It applies to the following scenarios: - **Model Fine-Tuning**: Serves as high-quality training data for fine-tuning language models of various scales to improve their logical reasoning, step-by-step problem decomposition, and instruction-following abilities in English contexts. - **Academic Research**: Enables investigation into knowledge distillation efficacy, CoT generation mechanisms, and the validity of LLM-as-a-Judge evaluation methodologies. - **Model Evaluation**: Functions as a benchmark to assess model performance on complex reasoning tasks. ### Out-of-Scope Use - **Critical Fact-Checking in Production**: Generated by AI models, this dataset may contain factual inaccuracies. It **shall not** be used in production environments requiring high factual precision (e.g., medical, legal, or financial advice) without rigorous fact verification. - **Non-English Tasks**: Content is exclusively in English and **is not suitable** for direct fine-tuning of Chinese or other language models. - **Malicious Applications**: **Strictly prohibited** for any harmful purposes, including but not limited to generating toxic, unethical, biased, or misleading content. ### Dataset Structure The dataset is provided in `.jsonl` format, where each line constitutes a JSON object. No predefined train/validation/test splits are included; users are advised to perform random splits according to their specific requirements. ### Data Fields - **id (string)**: Unique identifier for the sample. - **category (string)**: Category to which the question belongs. - **input (string)**: User's instruction or query. - **CoT_content (string)**: Chain-of-thought reasoning process generated by the teacher model. - **answer_content (string)**: Final answer generated by the teacher model. - **judge_reviews (list)**: List containing review comments from judge models. - **scores (dict)**: Dictionary containing specific scores for `input`, `cot`, and `answer` dimensions. - **total_score (float)**: Final quality score calculated using a weighted formula. - **num_judgments (int)**: Number of judgments (defaulting to 1). ### Data Instance ```json {"id": "hf_openai_gsm8k_00173", "category": "gsm8k", "input": "Mitchell has 30 pencils. He has 6 more pencils than Antonio. How many pencils does Mitchell and Antonio have together?", "CoT_content": "<|channel|>analysis<|message|>We need to parse the problem: Mitchell has 30 pencils. He has 6 more pencils than Antonio. So Antonio's pencils = Mitchell - 6 = 30 - 6 = 24. Then together they have 30 + 24 = 54 pencils.\n\nCheck: If Mitchell has 30 and Antonio has 24, difference is indeed 6. So answer 54.\n\nThus final answer: 54 pencils.\n\nWe need to produce answer with \"Final answer:\" on a new line.", "answer_content": "To find the total number of pencils:\n\n1. **Determine Antonio's pencils** \n Mitchell has 6 more than Antonio. \n \\[\n \\text{Antonio} = \\text{Mitchell} - 6 = 30 - 6 = 24\n \\]\n\n2. **Add their pencils together** \n \\[\n \\text{Total} = \\text{Mitchell} + \\text{Antonio} = 30 + 24 = 54\n \\]\n\nSo, Mitchell and Antonio have a total of **54 pencils**.\n\nFinal answer: 54", "judge_reviews": [{"id": "hf_openai_gsm8k_00173", "input_review": "The input is clear, directly stated, and requires basic arithmetic reasoning with a well-defined solution path.", "cot_review": "The chain-of-thought is logically coherent, correctly interprets the relationship between Mitchell and Antonio's pencils, and verifies the result.", "answer_review": "The final answer is correct, clearly formatted, includes proper justification, and ends with 'Final answer: 54' as required.", "scores": {"input": 10, "cot": 10, "answer": 10}}], "scores": {"input": 10, "cot": 10, "answer": 10}, "total_score": 10.0, "num_judgments": 1} ``` ### Data Splits This dataset **does not include predefined train/validation/test splits**. Users are **advised to perform random splits** according to their specific requirements to: - Prevent data leakage during evaluation - Maintain category distribution balance (critical for reasoning tasks) - Align with downstream task constraints # Acknowledgments The creation of this dataset was made possible by the following outstanding open-source models and datasets. We extend our sincere gratitude: - **Teacher Model**: `opeanai/gpt-oss-20b` - **Judge Model**: `Qwen/Qwen3-235B-A22B-Instruct-2507` - **Seed Question Sources**: `facebook/natural_reasoning`, `vicgalle/alpaca-gpt4`, `openai/gsm8k`, `fka/awesome-chatgpt-prompts`, `microsoft/rStar-Coder`, `nvidia/OpenScience` ```bibtex @dataset{jackrong2025gptossdistilled, author = {JACKRONG}, title = {Jackrong/GPT-OSS-20B-Distilled-Reasoning-Mini}, month = aug, year = {2025}, publisher = {Hugging Face}, version = {1.0.0}, url = {https://huggingface.co/datasets/Jackrong/GPT-OSS-20B-Distilled-Reasoning-Mini} } ```

# 数据集卡片:数据集名称 GPT-OSS-20B蒸馏推理迷你数据集 * (面向推理生成的多阶段评估优化方法) ## 数据集详情与说明 本数据集是通过知识蒸馏构建的高质量指令微调数据集,包含完整的思维链(Chain-of-Thought, CoT)推理过程,旨在提升小型语言模型在复杂推理、逻辑分析与指令遵循方面的能力。 我通过精心设计的三阶段流程构建了本数据集: - **种子问题聚合**:我从包括自研在内的多个高质量数据源中广泛收集并筛选种子问题。 - **知识蒸馏与清洗**:我以`opeanai/gpt-oss-20b(High)`作为教师模型,为种子问题生成深度推理内容与回答,随后对生成结果执行严格的自动化清洗流程。 - **以大语言模型为裁判的质量评估**:我使用`Qwen/Qwen3-235B-A22B-Instruct-2507`作为裁判模型,从多个维度对生成的数据进行评分,以确保最终输出的质量。 - **数据整理者**:[Jack Rong] - **资助方**:[Jack Rong] - **共享方**:[Jack Rong] - **自然语言处理语言**:[英语] - **许可证**:[Apache-2.0] - **教师模型**:`unsloth/gpt-oss-20b` - **裁判模型**:`Qwen/Qwen3-235B-A22B-Instruct-2507` #### 源数据 <div style="display: flex; gap: 10px; justify-content: space-around;"> <img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/hEWXvzYitkXka19sBs9k_.png" style="width: calc(33.33% - 10px);"> <img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/61KVV5AJeDrm5FpMrxZzC.png" style="width: calc(33.33% - 10px);"> <img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/YQ49_Kdu8kPY9VfWnAD0_.png" style="width: calc(33.33% - 10px);"> </div> ### 第一阶段:种子问题聚合 为确保蒸馏指令源的多样性与全面性,我采用**多源聚合策略**构建种子问题池: <img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/bm5_WgDdPTFnWFgGSBGTs.png" width="300"> - **自研生成(300条)**: 使用**unsloth/Qwen3-4B-Instruct-2507**模型,我基于7种预设的核心思维模板生成了300条高质量、具有代表性的初始问题: - 逻辑推理 - 基础数学 - 科学知识 - 编程思维 - 常识判断 - 创意想象 - 伦理困境 - **外部数据集整合**: 为进一步拓展覆盖范围,我们从多个成熟的公开数据集中采样,包括: - `facebook/natural_reasoning` - `vicgalle/alpaca-gpt4` - `openai/gsm8k` - `fka/awesome-chatgpt-prompts` - `microsoft/rStar-Coder` - `nvidia/OpenScience` 在蒸馏前,所有种子问题均经过严格的自动化清洗,过滤掉以下类型的条目: - 缺失引用附件的条目 - 非英语内容条目 - 过短的查询条目 该流程从初始的1994条候选问题中筛选出1990条高质量问题,用于最终的蒸馏流程。 ### 第二阶段:蒸馏与清洗 <div style="display: flex; gap: 10px;"> <img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/P4LYLG5K6FNJ-xrHkFO0X.png" alt="Image 1" width="300"> <img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/0kgR7OMbtEF7SQJ1Z-T7w.png" alt="Image 2" width="300"> </div> ![image/png](https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/knsldQhMFoy4eyF-X3qi-.png) 本阶段是生成高质量思维链(CoT)内容的核心环节。整个工作流确保**输入与输出均经过严格的质量验证**——不仅对输入内容进行清洗,还会系统性地校验输出内容。 - **蒸馏生成**: 我以`opeanai/gpt-oss-20b(High)`作为教师模型。针对每个种子问题,我们使用专门设计的提示模板,明确要求模型完成以下两步: (1) 执行详细的**“逐步思考”式思维链推理**,随后 (2) 给出最终答案 该结构化方法确保模型在得出结论前完成全面的推理过程。 - **蒸馏后清洗**: 教师模型生成的原始思维链输出与回答不会被直接采用。相反,我们会执行额外的质量评估与清洗流程,以: - 剔除由**模型幻觉**、**逻辑矛盾**或**格式错误**导致的低质量输出 - 过滤不符合事实一致性或结构完整性要求的条目 该关键步骤确保最终数据集中的每一条样本都包含高质量、可靠的推理过程与答案。 ### 第三阶段:以大语言模型为裁判的质量评分 为定量评估蒸馏后数据的质量,我们搭建了**以大语言模型为裁判**的评估框架,从多个推理维度进行打分。 - **裁判模型**: 我们使用性能更强的`Qwen/Qwen3-235B-A22B-Instruct-2507`模型作为裁判(遵循OpenAI/Anthropic模型文档的规范,保留235亿参数的准确标注)。 - **多维度评估准则**: 我们设计了一套完整的**三维评分体系**,要求裁判模型从1到10分对每条样本进行评分: - **输入评分**:衡量问题清晰度、可解答性、推理密度与上下文相关性 - **思维链评分**:评估逻辑连贯性、**逻辑跳跃**的存在情况、约束遵循程度以及与最终答案的一致性 - **答案评分**:评估答案的准确性、完整性、**格式合规性**与鲁棒性 - **加权评分公式**: 为突出推理与答案质量的权重,我们采用加权平均公式: `总评分 = 0.15 × 输入评分 + 0.35 × 思维链评分 + 0.5 × 答案评分` - **评分结果**: 在对1946条样本(剔除44条失败样本)完成评估后,本数据集展现出**优异的质量**,绝大多数评分集中在8.0-10.0区间。该分布从统计学上验证了我们三阶段蒸馏流程的有效性。 <div style="display: flex; gap: 12px; align-items: center;"> <img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/-5qm3NXBKM2q_Twg2DfHy.png" alt="图1" width="450" style="object-fit: contain;"> <img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/W7G5ua95zMGVTyaGxbq9N.png" alt="图2" width="280" style="object-fit: contain;"> </div> <img src="https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/_-c1UaAqYGn_2d81CPVtx.png" alt="800px 图像" width="650" style="display:block; margin:auto;"> ### 直接使用场景 本数据集主要面向语言模型的指令微调任务,旨在提升模型的思维链(CoT)生成与复杂推理能力,适用于以下场景: - **模型微调**:作为高质量训练数据,用于不同规模语言模型的微调,提升其在英语语境下的逻辑推理、逐步问题拆解与指令遵循能力。 - **学术研究**:可用于研究知识蒸馏的效果、思维链生成机制以及以大语言模型为裁判的评估方法的有效性。 - **模型评估**:可作为基准数据集,评估模型在复杂推理任务上的性能。 ### 禁用使用场景 - **生产环境中的关键事实核查**:本数据集由AI模型生成,可能包含事实错误。未经严格的事实校验,**不得**用于对事实精度要求极高的生产环境(如医疗、法律或金融咨询场景)。 - **非英语任务**:本数据集内容仅为英语,**不适用于**中文或其他语言模型的直接微调。 - **恶意应用**:**严格禁止**用于任何有害用途,包括但不限于生成有毒、不道德、带有偏见或误导性的内容。 ### 数据集结构 本数据集以`.jsonl`格式提供,每行均为一个JSON对象。数据集未包含预设的训练/验证/测试划分;建议用户根据自身需求进行随机划分。 ### 数据字段 - **id(字符串类型)**:样本的唯一标识符。 - **category(字符串类型)**:问题所属的类别。 - **input(字符串类型)**:用户的指令或查询。 - **CoT_content(字符串类型)**:教师模型生成的思维链推理过程。 - **answer_content(字符串类型)**:教师模型生成的最终答案。 - **judge_reviews(列表类型)**:包含裁判模型评审意见的列表。 - **scores(字典类型)**:包含`input`、`cot`与`answer`三个维度具体评分的字典。 - **total_score(浮点类型)**:通过加权公式计算得到的最终质量评分。 - **num_judgments(整数类型)**:评审次数(默认值为1)。 ### 数据示例 json {"id": "hf_openai_gsm8k_00173", "category": "gsm8k", "input": "Mitchell has 30 pencils. He has 6 more pencils than Antonio. How many pencils does Mitchell and Antonio have together?", "CoT_content": "<|channel|>analysis<|message|>We need to parse the problem: Mitchell has 30 pencils. He has 6 more pencils than Antonio. So Antonio's pencils = Mitchell - 6 = 30 - 6 = 24. Then together they have 30 + 24 = 54 pencils. Check: If Mitchell has 30 and Antonio has 24, difference is indeed 6. So answer 54. Thus final answer: 54 pencils. We need to produce answer with "Final answer:" on a new line.", "answer_content": "To find the total number of pencils: 1. **Determine Antonio's pencils** Mitchell has 6 more than Antonio. \[ ext{Antonio} = ext{Mitchell} - 6 = 30 - 6 = 24 \] 2. **Add their pencils together** \[ ext{Total} = ext{Mitchell} + ext{Antonio} = 30 + 24 = 54 \] So, Mitchell and Antonio have a total of **54 pencils**. Final answer: 54", "judge_reviews": [{"id": "hf_openai_gsm8k_00173", "input_review": "The input is clear, directly stated, and requires basic arithmetic reasoning with a well-defined solution path.", "cot_review": "The chain-of-thought is logically coherent, correctly interprets the relationship between Mitchell and Antonio's pencils, and verifies the result.", "answer_review": "The final answer is correct, clearly formatted, includes proper justification, and ends with 'Final answer: 54' as required.", "scores": {"input": 10, "cot": 10, "answer": 10}}], "scores": {"input": 10, "cot": 10, "answer": 10}, "total_score": 10.0, "num_judgments": 1} ### 数据划分 本数据集**未包含预设的训练/验证/测试划分**。建议用户根据自身需求进行随机划分,以实现以下目标: - 避免评估过程中的数据泄露 - 保持类别分布的平衡性(对推理任务至关重要) - 适配下游任务的约束条件 # 致谢 本数据集的构建离不开以下优秀的开源模型与数据集,在此致以诚挚的谢意: - **教师模型**:`opeanai/gpt-oss-20b` - **裁判模型**:`Qwen/Qwen3-235B-A22B-Instruct-2507` - **种子问题数据源**: `facebook/natural_reasoning`, `vicgalle/alpaca-gpt4`, `openai/gsm8k`, `fka/awesome-chatgpt-prompts`, `microsoft/rStar-Coder`, `nvidia/OpenScience` bibtex @dataset{jackrong2025gptossdistilled, author = {JACKRONG}, title = {Jackrong/GPT-OSS-20B-Distilled-Reasoning-Mini}, month = aug, year = {2025}, publisher = {Hugging Face}, version = {1.0.0}, url = {https://huggingface.co/datasets/Jackrong/GPT-OSS-20B-Distilled-Reasoning-Mini} }
提供机构:
maas
创建时间:
2025-08-19
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作