gpt-oss-120b-reasoning-STEM-5K
收藏魔搭社区2026-01-08 更新2025-08-23 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/gpt-oss-120b-reasoning-STEM-5K
下载链接
链接失效反馈官方服务:
资源简介:
# GPT-OSS-120B-Distilled-Reasoning-STEM Dataset
## 1) Dataset Overview
* **Data Source Model**: `gpt-oss-120b-high`
* **Task Type**: **STEM Reasoning and Problem Solving** (Science, Technology, Engineering & Mathematics)
* **Data Format**: `JSON Lines
* **Fields**: `generator`, `category`, `input`, `CoT_Native——reasoning`, `answer`
(Consistent with the math dataset, splitting the original 'output' into 'reasoning' and 'answer' for COT/SFT scenarios.)

## 2) Design Goals (Motivation)
This dataset targets **comprehensive STEM reasoning**: covering concept understanding, **multi-step deduction**, and **formula/theorem application** across **Mathematics, Physics, Chemistry, Computer Science, Engineering, and Life Sciences**. Unlike datasets that only provide the final answer, this dataset clearly distinguishes between "**reasoning chain (reasoning)**" and "**answer (answer)**", facilitating:
* Training/fine-tuning models with **reasoning capabilities** (**COT/SFT**).
* Evaluating **dual-dimensional metrics** of "**correct process/correct answer**".
* **Flexible extraction** for different downstream tasks (QA only, reasoning only, full dataset).
Regarding transparency and responsible use information that should be included in data cards, I follow industry best practices such as **Dataset Cards / Datasheets for Datasets / Dataset Nutrition Label**.
## 3) Data Source & Distillation
* **Generation Model**: **gpt-oss-120b** (high reasoning configuration), generating complete reasoning processes and final answers.
* **Prompt Coverage**: **Multiple-choice questions, short-answer questions, calculation/proofs, concept questions**, etc., across **STEM** fields.
* **Cleaning and Structuring**:
* Splitting the model's full output into `reasoning` and `answer` segments, preserving "**native reasoning text**";
* Filtering out **empty answers, truncated/overflowing outputs**, and **clearly incomplete samples**;
* **Deduplication** primarily based on the `input` prompt;
* **Standardizing fields and labels** to ensure **downstream usability**.
## 4) Schema & Format
**One sample per line (JSONL)**:
```json
{
"Generator": "gpt-oss-120b",
"Category": "stem",
"Input": "Prompt/Question Text",
"CoT_Native——reasoning": "Complete multi-step thinking process (native reasoning chain)",
"Answer": "Final answer text (decoupled from reasoning)"
}
```
## 5) Quality & Known Limitations
* **Generative Bias**: Samples originate from the same large model family, potentially introducing **stylistic and knowledge biases**; it is recommended to **mix with other sources** for training to enhance **robustness**.
* **Reasoning Chain Noise**: Despite cleaning, individual samples may still exhibit **verbosity/repetition** or **slight inconsistencies** with the answer; it is recommended to perform **secondary audits** before training or use **heuristic rules** for further segmentation and annotation.
* **Subject Coverage**: The distribution across STEM sub-fields is **not perfectly balanced**; samples for **long-tail subjects** (e.g., Materials Science/Fluid Dynamics) are **relatively fewer**.
* **Responsible Use**: Should **not be used for exam cheating** or generating **misleading scientific conclusions**; for **high-risk scenarios** involving Medicine/Chemistry/Engineering, etc., it is crucial to include **expert review** or **human verification** steps.
## 6) License & Intended Use
* **License**: **CC-BY-4.0**
* **Primary Use**: For **COT/SFT training**, **reasoning evaluation**, **data analysis**, and **visualization teaching** in **academic/industrial scenarios**.
* **Out-of-Scope**: It is **not recommended** to directly use reasoning chains "**without human review**" for **high-risk decisions** such as **clinical diagnosis**, **engineering safety reviews**, etc.
## 7) Acknowledgements
* The construction of this dataset is based on the generation capabilities of gpt-oss-120b and the optimized design of STEM reasoning templates.
* **Seed Questions**: Derived in part from nvidia/Nemotron-Post-Training-Dataset-v1.
# GPT-OSS-120B-Distilled-Reasoning-STEM 数据集
## 1) 数据集概览
* **数据来源模型**:`gpt-oss-120b-high`
* **任务类型**:**STEM推理与问题求解(科学、技术、工程与数学)**
* **数据格式**:`JSON行格式(JSON Lines)`
* **字段**:`generator`、`category`、`input`、`CoT_Native——reasoning`、`answer`
(与数学数据集保持一致,将原始`output`字段拆分为`reasoning`与`answer`,以适配思维链(Chain of Thought, CoT)/监督微调(Supervised Fine-Tuning, SFT)场景。)

## 2) 设计目标(创作动机)
本数据集面向**全维度STEM推理任务**:覆盖数学、物理、化学、计算机科学、工程学与生命科学六大领域的概念理解、**多步演绎推理**以及**公式/定理应用**能力。与仅提供最终答案的数据集不同,本数据集清晰区分「推理链(推理过程)」与「答案(最终结果)」,可实现以下应用场景:
* 训练/微调具备**推理能力**的模型(思维链(Chain of Thought, CoT)/监督微调(Supervised Fine-Tuning, SFT))。
* 评估「推理过程正确/最终答案正确」的双维度指标。
* 支持针对不同下游任务的灵活抽取(仅问答、仅推理、完整数据集)。
关于数据卡片中应包含的透明度与负责任使用相关信息,本数据集遵循行业最佳实践,采用**数据集卡片(Dataset Cards)/数据集手册(Datasheets for Datasets)/数据集营养标签(Dataset Nutrition Label)**的规范。
## 3) 数据来源与蒸馏流程
* **生成模型**:**gpt-oss-120b**(高推理配置),可生成完整的推理过程与最终答案。
* **提示词覆盖范围**:涵盖STEM领域的选择题、简答题、计算/证明题、概念题等多种题型。
* **清洗与结构化处理**:
* 将模型生成的完整输出拆分为`reasoning`与`answer`字段,保留「原生推理文本」;
* 过滤空答案、截断/溢出的输出以及明显不完整的样本;
* 主要基于`input`提示词字段进行去重;
* 标准化字段与标签格式,保障下游任务可用性。
## 4) 数据架构与格式
**每行对应一个样本(JSON行格式,JSON Lines)**:
json
{
"Generator": "gpt-oss-120b",
"Category": "stem",
"Input": "Prompt/Question Text",
"CoT_Native——reasoning": "Complete multi-step thinking process (native reasoning chain)",
"Answer": "Final answer text (decoupled from reasoning)"
}
## 5) 数据质量与已知局限性
* **生成偏差**:样本均源自同一大模型家族,可能引入**风格与知识偏差**;建议在训练时与其他数据源混合使用,以提升模型**鲁棒性**。
* **推理链噪声**:尽管经过清洗,部分样本仍可能存在**冗余/重复**或与答案存在轻微不一致的情况;建议在训练前进行**二次审核**,或采用启发式规则进行进一步分段与标注。
* **主题覆盖度**:STEM各子领域的样本分布**并非完全均衡**;长尾主题(如材料科学、流体力学)的样本量相对较少。
* **负责任使用**:不得用于考试作弊或生成**误导性科学结论**;对于医学、化学、工程等涉及**高风险场景**的应用,必须加入**专家审核**或**人工验证**环节。
## 6) 授权协议与预期用途
* **授权协议**:**CC-BY-4.0**
* **主要用途**:适用于学术与工业场景下的思维链(Chain of Thought, CoT)/监督微调(Supervised Fine-Tuning, SFT)训练、推理能力评估、数据分析与可视化教学。
* **禁止场景**:不建议在**未经人工审核**的情况下,将推理链直接用于临床诊断、工程安全审查等**高风险决策**场景。
## 7) 致谢
* 本数据集的构建基于gpt-oss-120b的生成能力与STEM推理模板的优化设计。
* **种子问题**:部分源自nvidia/Nemotron-Post-Training-Dataset-v1数据集。
提供机构:
maas
创建时间:
2025-08-19



