five

gpt-oss-120b-reasoning-STEM-5K

收藏
魔搭社区2026-01-08 更新2025-08-23 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/gpt-oss-120b-reasoning-STEM-5K
下载链接
链接失效反馈
官方服务:
资源简介:
# GPT-OSS-120B-Distilled-Reasoning-STEM Dataset ## 1) Dataset Overview * **Data Source Model**: `gpt-oss-120b-high` * **Task Type**: **STEM Reasoning and Problem Solving** (Science, Technology, Engineering & Mathematics) * **Data Format**: `JSON Lines * **Fields**: `generator`, `category`, `input`, `CoT_Native——reasoning`, `answer` (Consistent with the math dataset, splitting the original 'output' into 'reasoning' and 'answer' for COT/SFT scenarios.) ![image/png](https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/EW5fKpnazIIw8SEd6hnaW.png) ## 2) Design Goals (Motivation) This dataset targets **comprehensive STEM reasoning**: covering concept understanding, **multi-step deduction**, and **formula/theorem application** across **Mathematics, Physics, Chemistry, Computer Science, Engineering, and Life Sciences**. Unlike datasets that only provide the final answer, this dataset clearly distinguishes between "**reasoning chain (reasoning)**" and "**answer (answer)**", facilitating: * Training/fine-tuning models with **reasoning capabilities** (**COT/SFT**). * Evaluating **dual-dimensional metrics** of "**correct process/correct answer**". * **Flexible extraction** for different downstream tasks (QA only, reasoning only, full dataset). Regarding transparency and responsible use information that should be included in data cards, I follow industry best practices such as **Dataset Cards / Datasheets for Datasets / Dataset Nutrition Label**. ## 3) Data Source & Distillation * **Generation Model**: **gpt-oss-120b** (high reasoning configuration), generating complete reasoning processes and final answers. * **Prompt Coverage**: **Multiple-choice questions, short-answer questions, calculation/proofs, concept questions**, etc., across **STEM** fields. * **Cleaning and Structuring**: * Splitting the model's full output into `reasoning` and `answer` segments, preserving "**native reasoning text**"; * Filtering out **empty answers, truncated/overflowing outputs**, and **clearly incomplete samples**; * **Deduplication** primarily based on the `input` prompt; * **Standardizing fields and labels** to ensure **downstream usability**. ## 4) Schema & Format **One sample per line (JSONL)**: ```json { "Generator": "gpt-oss-120b", "Category": "stem", "Input": "Prompt/Question Text", "CoT_Native——reasoning": "Complete multi-step thinking process (native reasoning chain)", "Answer": "Final answer text (decoupled from reasoning)" } ``` ## 5) Quality & Known Limitations * **Generative Bias**: Samples originate from the same large model family, potentially introducing **stylistic and knowledge biases**; it is recommended to **mix with other sources** for training to enhance **robustness**. * **Reasoning Chain Noise**: Despite cleaning, individual samples may still exhibit **verbosity/repetition** or **slight inconsistencies** with the answer; it is recommended to perform **secondary audits** before training or use **heuristic rules** for further segmentation and annotation. * **Subject Coverage**: The distribution across STEM sub-fields is **not perfectly balanced**; samples for **long-tail subjects** (e.g., Materials Science/Fluid Dynamics) are **relatively fewer**. * **Responsible Use**: Should **not be used for exam cheating** or generating **misleading scientific conclusions**; for **high-risk scenarios** involving Medicine/Chemistry/Engineering, etc., it is crucial to include **expert review** or **human verification** steps. ## 6) License & Intended Use * **License**: **CC-BY-4.0** * **Primary Use**: For **COT/SFT training**, **reasoning evaluation**, **data analysis**, and **visualization teaching** in **academic/industrial scenarios**. * **Out-of-Scope**: It is **not recommended** to directly use reasoning chains "**without human review**" for **high-risk decisions** such as **clinical diagnosis**, **engineering safety reviews**, etc. ## 7) Acknowledgements * The construction of this dataset is based on the generation capabilities of gpt-oss-120b and the optimized design of STEM reasoning templates. * **Seed Questions**: Derived in part from nvidia/Nemotron-Post-Training-Dataset-v1.

# GPT-OSS-120B-Distilled-Reasoning-STEM 数据集 ## 1) 数据集概览 * **数据来源模型**:`gpt-oss-120b-high` * **任务类型**:**STEM推理与问题求解(科学、技术、工程与数学)** * **数据格式**:`JSON行格式(JSON Lines)` * **字段**:`generator`、`category`、`input`、`CoT_Native——reasoning`、`answer` (与数学数据集保持一致,将原始`output`字段拆分为`reasoning`与`answer`,以适配思维链(Chain of Thought, CoT)/监督微调(Supervised Fine-Tuning, SFT)场景。) ![image/png](https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/EW5fKpnazIIw8SEd6hnaW.png) ## 2) 设计目标(创作动机) 本数据集面向**全维度STEM推理任务**:覆盖数学、物理、化学、计算机科学、工程学与生命科学六大领域的概念理解、**多步演绎推理**以及**公式/定理应用**能力。与仅提供最终答案的数据集不同,本数据集清晰区分「推理链(推理过程)」与「答案(最终结果)」,可实现以下应用场景: * 训练/微调具备**推理能力**的模型(思维链(Chain of Thought, CoT)/监督微调(Supervised Fine-Tuning, SFT))。 * 评估「推理过程正确/最终答案正确」的双维度指标。 * 支持针对不同下游任务的灵活抽取(仅问答、仅推理、完整数据集)。 关于数据卡片中应包含的透明度与负责任使用相关信息,本数据集遵循行业最佳实践,采用**数据集卡片(Dataset Cards)/数据集手册(Datasheets for Datasets)/数据集营养标签(Dataset Nutrition Label)**的规范。 ## 3) 数据来源与蒸馏流程 * **生成模型**:**gpt-oss-120b**(高推理配置),可生成完整的推理过程与最终答案。 * **提示词覆盖范围**:涵盖STEM领域的选择题、简答题、计算/证明题、概念题等多种题型。 * **清洗与结构化处理**: * 将模型生成的完整输出拆分为`reasoning`与`answer`字段,保留「原生推理文本」; * 过滤空答案、截断/溢出的输出以及明显不完整的样本; * 主要基于`input`提示词字段进行去重; * 标准化字段与标签格式,保障下游任务可用性。 ## 4) 数据架构与格式 **每行对应一个样本(JSON行格式,JSON Lines)**: json { "Generator": "gpt-oss-120b", "Category": "stem", "Input": "Prompt/Question Text", "CoT_Native——reasoning": "Complete multi-step thinking process (native reasoning chain)", "Answer": "Final answer text (decoupled from reasoning)" } ## 5) 数据质量与已知局限性 * **生成偏差**:样本均源自同一大模型家族,可能引入**风格与知识偏差**;建议在训练时与其他数据源混合使用,以提升模型**鲁棒性**。 * **推理链噪声**:尽管经过清洗,部分样本仍可能存在**冗余/重复**或与答案存在轻微不一致的情况;建议在训练前进行**二次审核**,或采用启发式规则进行进一步分段与标注。 * **主题覆盖度**:STEM各子领域的样本分布**并非完全均衡**;长尾主题(如材料科学、流体力学)的样本量相对较少。 * **负责任使用**:不得用于考试作弊或生成**误导性科学结论**;对于医学、化学、工程等涉及**高风险场景**的应用,必须加入**专家审核**或**人工验证**环节。 ## 6) 授权协议与预期用途 * **授权协议**:**CC-BY-4.0** * **主要用途**:适用于学术与工业场景下的思维链(Chain of Thought, CoT)/监督微调(Supervised Fine-Tuning, SFT)训练、推理能力评估、数据分析与可视化教学。 * **禁止场景**:不建议在**未经人工审核**的情况下,将推理链直接用于临床诊断、工程安全审查等**高风险决策**场景。 ## 7) 致谢 * 本数据集的构建基于gpt-oss-120b的生成能力与STEM推理模板的优化设计。 * **种子问题**:部分源自nvidia/Nemotron-Post-Training-Dataset-v1数据集。
提供机构:
maas
创建时间:
2025-08-19
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作