Logics-MLLM/Logics-STEM-SFT-Dataset-Open-1.6M

Name: Logics-MLLM/Logics-STEM-SFT-Dataset-Open-1.6M
Creator: Logics-MLLM
Published: 2026-01-19 02:18:04
License: 暂无描述

Hugging Face2026-01-19 更新2026-02-07 收录

下载链接：

https://hf-mirror.com/datasets/Logics-MLLM/Logics-STEM-SFT-Dataset-Open-1.6M

下载链接

链接失效反馈

官方服务：

资源简介：

Logics-STEM-SFT-Dataset-2.2M是一个精心策划的长链思维（CoT）监督微调（SFT）数据集，专为STEM推理设计。它基于高质量的开源数据，并通过严格的数据筛选和增强引擎构建而成。数据集包含由Qwen3-235B-A22B-Thinking-2507模型生成的提示-响应对，涵盖数学和更广泛的STEM领域（如物理、化学、生物、工程和计算机科学）。由于许可限制，部分内部组件未完全开源，因此提供了Logics-STEM-SFT-Dataset-Open-1.6M作为替代。数据集的构建过程包括注释、去重、去污染、响应蒸馏和加权分层抽样等步骤，旨在为训练具有推理能力的大型语言模型提供强大且通用的SFT基准。

Logics-STEM-SFT-Dataset-2.2M is a curated long Chain-of-Thought (CoT) SFT dataset for STEM reasoning, built on top of high-quality open-source data and enhanced through a rigorous curation and distillation data engine. It consists of prompt–response pairs distilled by Qwen3-235B-A22B-Thinking-2507, covering Math and broader STEM domains (e.g., physics, chemistry, biology, engineering, and computer science). Due to licensing constraints, some internal components are not fully open-sourced, leading to the release of Logics-STEM-SFT-Dataset-Open-1.6M as an alternative. The dataset is designed through stages including annotation, deduplication, decontamination, response distillation, and weighted stratified sampling, serving as a strong and general-purpose SFT baseline for training reasoning-capable LLMs.

提供机构：

Logics-MLLM

5,000+

优质数据集

54 个

任务类型

进入经典数据集