gpt-oss-120b-reasoning-STEM-5K

Name: gpt-oss-120b-reasoning-STEM-5K
Creator: maas
Published: 2026-01-08 17:18:57
License: 暂无描述

魔搭社区2026-01-08 更新2025-08-23 收录

下载链接：

https://modelscope.cn/datasets/AI-ModelScope/gpt-oss-120b-reasoning-STEM-5K

下载链接

链接失效反馈

官方服务：

资源简介：

# GPT-OSS-120B-Distilled-Reasoning-STEM Dataset ## 1) Dataset Overview * **Data Source Model**: `gpt-oss-120b-high` * **Task Type**: **STEM Reasoning and Problem Solving** (Science, Technology, Engineering & Mathematics) * **Data Format**: `JSON Lines * **Fields**: `generator`, `category`, `input`, `CoT_Native——reasoning`, `answer` (Consistent with the math dataset, splitting the original 'output' into 'reasoning' and 'answer' for COT/SFT scenarios.) ![image/png](https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/EW5fKpnazIIw8SEd6hnaW.png) ## 2) Design Goals (Motivation) This dataset targets **comprehensive STEM reasoning**: covering concept understanding, **multi-step deduction**, and **formula/theorem application** across **Mathematics, Physics, Chemistry, Computer Science, Engineering, and Life Sciences**. Unlike datasets that only provide the final answer, this dataset clearly distinguishes between "**reasoning chain (reasoning)**" and "**answer (answer)**", facilitating: * Training/fine-tuning models with **reasoning capabilities** (**COT/SFT**). * Evaluating **dual-dimensional metrics** of "**correct process/correct answer**". * **Flexible extraction** for different downstream tasks (QA only, reasoning only, full dataset). Regarding transparency and responsible use information that should be included in data cards, I follow industry best practices such as **Dataset Cards / Datasheets for Datasets / Dataset Nutrition Label**. ## 3) Data Source & Distillation * **Generation Model**: **gpt-oss-120b** (high reasoning configuration), generating complete reasoning processes and final answers. * **Prompt Coverage**: **Multiple-choice questions, short-answer questions, calculation/proofs, concept questions**, etc., across **STEM** fields. * **Cleaning and Structuring**: * Splitting the model's full output into `reasoning` and `answer` segments, preserving "**native reasoning text**"; * Filtering out **empty answers, truncated/overflowing outputs**, and **clearly incomplete samples**; * **Deduplication** primarily based on the `input` prompt; * **Standardizing fields and labels** to ensure **downstream usability**. ## 4) Schema & Format **One sample per line (JSONL)**: ```json { "Generator": "gpt-oss-120b", "Category": "stem", "Input": "Prompt/Question Text", "CoT_Native——reasoning": "Complete multi-step thinking process (native reasoning chain)", "Answer": "Final answer text (decoupled from reasoning)" } ``` ## 5) Quality & Known Limitations * **Generative Bias**: Samples originate from the same large model family, potentially introducing **stylistic and knowledge biases**; it is recommended to **mix with other sources** for training to enhance **robustness**. * **Reasoning Chain Noise**: Despite cleaning, individual samples may still exhibit **verbosity/repetition** or **slight inconsistencies** with the answer; it is recommended to perform **secondary audits** before training or use **heuristic rules** for further segmentation and annotation. * **Subject Coverage**: The distribution across STEM sub-fields is **not perfectly balanced**; samples for **long-tail subjects** (e.g., Materials Science/Fluid Dynamics) are **relatively fewer**. * **Responsible Use**: Should **not be used for exam cheating** or generating **misleading scientific conclusions**; for **high-risk scenarios** involving Medicine/Chemistry/Engineering, etc., it is crucial to include **expert review** or **human verification** steps. ## 6) License & Intended Use * **License**: **CC-BY-4.0** * **Primary Use**: For **COT/SFT training**, **reasoning evaluation**, **data analysis**, and **visualization teaching** in **academic/industrial scenarios**. * **Out-of-Scope**: It is **not recommended** to directly use reasoning chains "**without human review**" for **high-risk decisions** such as **clinical diagnosis**, **engineering safety reviews**, etc. ## 7) Acknowledgements * The construction of this dataset is based on the generation capabilities of gpt-oss-120b and the optimized design of STEM reasoning templates. * **Seed Questions**: Derived in part from nvidia/Nemotron-Post-Training-Dataset-v1.

# GPT-OSS-120B-Distilled-Reasoning-STEM 数据集 ## 1) 数据集概览 * **数据来源模型**：`gpt-oss-120b-high` * **任务类型**：**STEM推理与问题求解（科学、技术、工程与数学）** * **数据格式**：`JSON行格式（JSON Lines）` * **字段**：`generator`、`category`、`input`、`CoT_Native——reasoning`、`answer` （与数学数据集保持一致，将原始`output`字段拆分为`reasoning`与`answer`，以适配思维链（Chain of Thought, CoT）/监督微调（Supervised Fine-Tuning, SFT）场景。） ![image/png](https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/EW5fKpnazIIw8SEd6hnaW.png) ## 2) 设计目标（创作动机）本数据集面向**全维度STEM推理任务**：覆盖数学、物理、化学、计算机科学、工程学与生命科学六大领域的概念理解、**多步演绎推理**以及**公式/定理应用**能力。与仅提供最终答案的数据集不同，本数据集清晰区分「推理链（推理过程）」与「答案（最终结果）」，可实现以下应用场景： * 训练/微调具备**推理能力**的模型（思维链（Chain of Thought, CoT）/监督微调（Supervised Fine-Tuning, SFT））。 * 评估「推理过程正确/最终答案正确」的双维度指标。 * 支持针对不同下游任务的灵活抽取（仅问答、仅推理、完整数据集）。关于数据卡片中应包含的透明度与负责任使用相关信息，本数据集遵循行业最佳实践，采用**数据集卡片（Dataset Cards）/数据集手册（Datasheets for Datasets）/数据集营养标签（Dataset Nutrition Label）**的规范。 ## 3) 数据来源与蒸馏流程 * **生成模型**：**gpt-oss-120b**（高推理配置），可生成完整的推理过程与最终答案。 * **提示词覆盖范围**：涵盖STEM领域的选择题、简答题、计算/证明题、概念题等多种题型。 * **清洗与结构化处理**： * 将模型生成的完整输出拆分为`reasoning`与`answer`字段，保留「原生推理文本」； * 过滤空答案、截断/溢出的输出以及明显不完整的样本； * 主要基于`input`提示词字段进行去重； * 标准化字段与标签格式，保障下游任务可用性。 ## 4) 数据架构与格式 **每行对应一个样本（JSON行格式，JSON Lines）**： json { "Generator": "gpt-oss-120b", "Category": "stem", "Input": "Prompt/Question Text", "CoT_Native——reasoning": "Complete multi-step thinking process (native reasoning chain)", "Answer": "Final answer text (decoupled from reasoning)" } ## 5) 数据质量与已知局限性 * **生成偏差**：样本均源自同一大模型家族，可能引入**风格与知识偏差**；建议在训练时与其他数据源混合使用，以提升模型**鲁棒性**。 * **推理链噪声**：尽管经过清洗，部分样本仍可能存在**冗余/重复**或与答案存在轻微不一致的情况；建议在训练前进行**二次审核**，或采用启发式规则进行进一步分段与标注。 * **主题覆盖度**：STEM各子领域的样本分布**并非完全均衡**；长尾主题（如材料科学、流体力学）的样本量相对较少。 * **负责任使用**：不得用于考试作弊或生成**误导性科学结论**；对于医学、化学、工程等涉及**高风险场景**的应用，必须加入**专家审核**或**人工验证**环节。 ## 6) 授权协议与预期用途 * **授权协议**：**CC-BY-4.0** * **主要用途**：适用于学术与工业场景下的思维链（Chain of Thought, CoT）/监督微调（Supervised Fine-Tuning, SFT）训练、推理能力评估、数据分析与可视化教学。 * **禁止场景**：不建议在**未经人工审核**的情况下，将推理链直接用于临床诊断、工程安全审查等**高风险决策**场景。 ## 7) 致谢 * 本数据集的构建基于gpt-oss-120b的生成能力与STEM推理模板的优化设计。 * **种子问题**：部分源自nvidia/Nemotron-Post-Training-Dataset-v1数据集。

提供机构：

maas

创建时间：

2025-08-19

5,000+

优质数据集

54 个

任务类型

进入经典数据集