Logics-MLLM/Logics-STEM-SFT-Dataset-Open-1.6M
收藏Hugging Face2026-01-19 更新2026-02-07 收录
下载链接:
https://hf-mirror.com/datasets/Logics-MLLM/Logics-STEM-SFT-Dataset-Open-1.6M
下载链接
链接失效反馈官方服务:
资源简介:
Logics-STEM-SFT-Dataset-2.2M是一个精心策划的长链思维(CoT)监督微调(SFT)数据集,专为STEM推理设计。它基于高质量的开源数据,并通过严格的数据筛选和增强引擎构建而成。数据集包含由Qwen3-235B-A22B-Thinking-2507模型生成的提示-响应对,涵盖数学和更广泛的STEM领域(如物理、化学、生物、工程和计算机科学)。由于许可限制,部分内部组件未完全开源,因此提供了Logics-STEM-SFT-Dataset-Open-1.6M作为替代。数据集的构建过程包括注释、去重、去污染、响应蒸馏和加权分层抽样等步骤,旨在为训练具有推理能力的大型语言模型提供强大且通用的SFT基准。
Logics-STEM-SFT-Dataset-2.2M is a curated long Chain-of-Thought (CoT) SFT dataset for STEM reasoning, built on top of high-quality open-source data and enhanced through a rigorous curation and distillation data engine. It consists of prompt–response pairs distilled by Qwen3-235B-A22B-Thinking-2507, covering Math and broader STEM domains (e.g., physics, chemistry, biology, engineering, and computer science). Due to licensing constraints, some internal components are not fully open-sourced, leading to the release of Logics-STEM-SFT-Dataset-Open-1.6M as an alternative. The dataset is designed through stages including annotation, deduplication, decontamination, response distillation, and weighted stratified sampling, serving as a strong and general-purpose SFT baseline for training reasoning-capable LLMs.
提供机构:
Logics-MLLM



