five

Logics-STEM-SFT-Dataset-Open-1.6M

收藏
魔搭社区2026-05-15 更新2026-05-17 收录
下载链接:
https://modelscope.cn/datasets/Alibaba-DT/Logics-STEM-SFT-Dataset-Open-1.6M
下载链接
链接失效反馈
官方服务:
资源简介:
# Logics-STEM-SFT-Dataset-2.2M ## 📰 News - [2026.01.05]🔥 Release of our [Techinical Report](https://arxiv.org/abs/2601.01562). - [2026.01.05]🔥 Release the first version of [Logics-STEM-8B-SFT](https://huggingface.co/Logics-MLLM/Logics-STEM-8B-SFT), [Logics-STEM-8B-RL](https://huggingface.co/Logics-MLLM/Logics-STEM-8B-RL), [/Logics-STEM-SFT-Dataset-Open-1.6M](https://huggingface.co/datasets/Logics-MLLM/Logics-STEM-SFT-Dataset-Open-1.6M). --- # Overview ### What is this dataset? **Logics-STEM-SFT-Dataset-2.2M** is a curated **long Chain-of-Thought (CoT) SFT dataset for STEM reasoning**, built on top of high-quality open-source data and enhanced through a rigorous curation and distillation data engine. It consists of prompt–response pairs distilled by **Qwen3-235B-A22B-Thinking-2507**, covering Math and broader STEM domains (e.g., physics, chemistry, biology, engineering, and computer science). The dataset is designed to serve as a strong and general-purpose SFT baseline for training reasoning-capable LLMs. Due to licensing constraints, some internal components (e.g., **DLR-book**) cannot be fully open-sourced. As a result, we open **Logics-STEM-SFT-Dataset-Open-1.6M**, derived from the reported 2.2M downsampled version while preserving the same curation principles and data format. ### How is it curated? ![](./imgs/long_cot_data_engine.png) We adopt a data curation engine with the following stages: 1. **Annotation**: validity/unambiguity filtering; discipline/domain; educational level; answer type; verifiable answer (when applicable) 2. **Deduplication**: exact + near-duplicate removal 3. **Decontamination**: removal of samples overlapping with evaluation benchmarks (MinHash + n-gram) 4. **Response distillation**: teacher-model long-CoT generation; repetition suppression; optional verification & regeneration 5. **Weighted stratified sampling**: uses **response length as a proxy for difficulty**, balancing hard reasoning density and broad coverage ## Experimental Results ![](./imgs/math_evaluation_res.png) ![](./imgs/stem_evaluation_res.png) ## Quickstart ### Load with `datasets` ```python from datasets import load_dataset ds = load_dataset("Logics-MLLM/Logics-STEM-SFT-Dataset-2.2M") print(ds) print(ds["train"][0]) ``` ## Data Collection We collect questions from the following publicly available datasets: - **NuminaMath-1.5** (AI-MO/NuminaMath-1.5) - **OpenThoughts3** (OpenThoughts) - **Mixture-of-Thoughts / Open-R1** (Hugging Face Open-R1 related data recipe) - **AceReason-1.1-SFT** (AceReason-Nemotron 1.1 SFT data) - **AceReason-Math** - **OpenScienceReasoning-2** (nvidia/OpenScienceReasoning-2) - **OpenMathReasoning** (OpenMathReasoning; AIMO-2 winning solution dataset) - **Llama-Nemotron-Post-Training-Dataset** (nvidia Llama-Nemotron post-training dataset) - **DLR-Web** (DESIGNER: web-derived synthetic set) - **DLR-Book** (DESIGNER: book-derived synthetic set) - **Skywork-OR1-RL-Data** (Skywork Open Reasoner RL data) - **NaturalReasoning** (NaturalReasoning) - **DeepMath-103K** (DeepMath-103K) - **DAPO-Math-17K** (DAPO Math 17K) - **TheoremQA** (TheoremQA) - **JEEBench** (JEEBench) - **GPQA-Main** (GPQA) - **GSM8K** (GSM8K) - **AIME** (di-zhang-fdu/AIME_1983_2024) - **AMC** (kaggle-aimo/amc_filtered) - **s1-teasers** (s1 dataset) - **s1-probs** (s1 dataset) - **openaimath** (s1 / OpenAI-math style dataset referenced by s1) *Acknowledgement: We sincerely thank the creators and maintainers of these open-source datasets for making this work possible.* ## Citation If you use this dataset, please cite our technical report: ```bibtex @misc{xu2026logicsstemempoweringllmreasoning, title={Logics-STEM: Empowering LLM Reasoning via Failure-Driven Post-Training and Document Knowledge Enhancement}, author={Mingyu Xu and Cheng Fang and Keyue Jiang and Yuqian Zheng and Yanghua Xiao and Baojian Zhou and Qifang Zhao and Suhang Zheng and Xiuwen Zhu and Jiyang Tang and Yongchi Zhao and Yijia Luo and Zhiqi Bai and Yuchi Xu and Wenbo Su and Wei Wang and Bing Zhao and Lin Qu and Xiaoxiao Xu}, year={2026}, eprint={2601.01562}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2601.01562}, } ```

# Logics-STEM-监督微调数据集2.2M ## 📰 新闻 - [2026.01.05]🔥 发布我们的[技术报告](https://arxiv.org/abs/2601.01562)。 - [2026.01.05]🔥 首次发布[Logics-STEM-8B-SFT](https://huggingface.co/Logics-MLLM/Logics-STEM-8B-SFT)、[Logics-STEM-8B-RL](https://huggingface.co/Logics-MLLM/Logics-STEM-8B-RL)以及[Logics-STEM-SFT-Dataset-Open-1.6M](https://huggingface.co/datasets/Logics-MLLM/Logics-STEM-SFT-Dataset-Open-1.6M)。 --- # 概述 ### 什么是该数据集? **Logics-STEM-SFT-Dataset-2.2M**是一款精心打造的**长思维链(Chain-of-Thought,CoT)监督微调(Supervised Fine-Tuning,SFT)数据集,用于STEM推理**,其数据基础为高质量开源资源,并通过严谨的筛选与提炼数据引擎完成增强处理。该数据集包含由**Qwen3-235B-A22B-Thinking-2507**提炼的提示-响应对,覆盖数学及更广泛的STEM领域(如物理、化学、生物、工程与计算机科学)。本数据集旨在为训练具备推理能力的大语言模型(Large Language Model,LLM)提供一款高性能、通用的SFT基准模型。 由于许可限制,部分内部组件(如**DLR-book**)无法完全开源。因此,我们开源了**Logics-STEM-SFT-Dataset-Open-1.6M**,该数据集源自220万条数据的下采样版本,完整保留了相同的筛选原则与数据格式。 ### 数据集筛选流程 ![](./imgs/long_cot_data_engine.png) 我们采用了包含以下阶段的数据筛选引擎: 1. **标注环节**:有效性/无歧义性筛选;学科/领域划分;教育层级分类;答案类型标注;可验证答案生成(如适用场景) 2. **去重处理**:移除精确重复与近似重复样本 3. **数据去污染**:采用MinHash + n-gram方法,移除与现有评估基准重叠的样本 4. **响应提炼**:教师模型长CoT生成;重复内容抑制;可选验证与重新生成流程 5. **加权分层采样**:以**响应长度作为难度代理指标**,平衡高难度推理密度与广泛的领域覆盖范围 ## 实验结果 ![](./imgs/math_evaluation_res.png) ![](./imgs/stem_evaluation_res.png) ## 快速上手 ### 使用`datasets`库加载 python from datasets import load_dataset ds = load_dataset("Logics-MLLM/Logics-STEM-SFT-Dataset-2.2M") print(ds) print(ds["train"][0]) ## 数据采集 我们从以下公开数据集采集问题: - **NuminaMath-1.5**(AI-MO/NuminaMath-1.5) - **OpenThoughts3**(OpenThoughts) - **Mixture-of-Thoughts / Open-R1**(Hugging Face Open-R1相关数据制作流程) - **AceReason-1.1-SFT**(AceReason-Nemotron 1.1 SFT数据集) - **AceReason-Math** - **OpenScienceReasoning-2**(nvidia/OpenScienceReasoning-2) - **OpenMathReasoning**(OpenMathReasoning;AIMO-2获奖方案数据集) - **Llama-Nemotron-Post-Training-Dataset**(nvidia Llama-Nemotron后训练数据集) - **DLR-Web**(DESIGNER:网页衍生合成数据集) - **DLR-Book**(DESIGNER:书籍衍生合成数据集) - **Skywork-OR1-RL-Data**(Skywork Open Reasoner强化学习数据集) - **NaturalReasoning**(NaturalReasoning) - **DeepMath-103K**(DeepMath-103K) - **DAPO-Math-17K**(DAPO Math 17K) - **TheoremQA**(TheoremQA) - **JEEBench**(JEEBench) - **GPQA-Main**(GPQA) - **GSM8K**(GSM8K) - **AIME**(di-zhang-fdu/AIME_1983_2024) - **AMC**(kaggle-aimo/amc_filtered) - **s1-teasers**(s1数据集) - **s1-probs**(s1数据集) - **openaimath**(s1参考的s1/OpenAI-math风格数据集) *致谢:衷心感谢上述开源数据集的创建者与维护者,为本研究提供了核心数据支撑。* ## 引用 若您使用本数据集,请引用我们的技术报告: bibtex @misc{xu2026logicsstemempoweringllmreasoning, title={Logics-STEM: Empowering LLM Reasoning via Failure-Driven Post-Training and Document Knowledge Enhancement}, author={Mingyu Xu and Cheng Fang and Keyue Jiang and Yuqian Zheng and Yanghua Xiao and Baojian Zhou and Qifang Zhao and Suhang Zheng and Xiuwen Zhu and Jiyang Tang and Yongchi Zhao and Yijia Luo and Zhiqi Bai and Yuchi Xu and Wenbo Su and Wei Wang and Bing Zhao and Lin Qu and Xiaoxiao Xu}, year={2026}, eprint={2601.01562}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2601.01562}, }
提供机构:
maas
创建时间:
2026-01-19
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作