five

Logics-STEM-SFT-Dataset-2.2M

收藏
魔搭社区2026-01-09 更新2026-01-10 收录
下载链接:
https://modelscope.cn/datasets/Alibaba-DT/Logics-STEM-SFT-Dataset-2.2M
下载链接
链接失效反馈
官方服务:
资源简介:
# Logics-STEM-SFT-Dataset-2.2M ## 📰 News - [2026.01.05]🔥 Release of our [Techinical Report](https://arxiv.org/abs/2601.01562). - [2026.01.05]🔥 Release the first version of [Logics-STEM-8B-SFT](https://huggingface.co/Logics-MLLM/Logics-STEM-8B-SFT), [Logics-STEM-8B-RL](https://huggingface.co/Logics-MLLM/Logics-STEM-8B-RL), [Logics-STEM-SFT-Dataset-2.2M](https://huggingface.co/datasets/Logics-MLLM/Logics-STEM-SFT-Dataset-2.2M). --- # Overview ### What is this dataset? **Logics-STEM-SFT-Dataset-2.2M** is a curated **long Chain-of-Thought (CoT) SFT dataset for STEM reasoning**, built on top of high-quality open-source data and enhanced through a rigorous curation and distillation data engine. It consists of prompt–response pairs distilled by **Qwen3-235B-A22B-Thinking-2507**, covering Math and broader STEM domains (e.g., physics, chemistry, biology, engineering, and computer science). The dataset is designed to serve as a strong and general-purpose SFT baseline for training reasoning-capable LLMs. > Release note: due to licensing/permission constraints, some internal components (e.g., **DLR-book**) cannot be fully open-sourced. As a result, **this repository releases ~1.58M samples**, derived from the reported 2.2M downsampled version while preserving the same curation principles and data format. ### How is it curated? ![](./imgs/long_cot_data_engine.png) We adopt a data curation engine with the following stages: 1. **Annotation**: validity/unambiguity filtering; discipline/domain; educational level; answer type; verifiable answer (when applicable) 2. **Deduplication**: exact + near-duplicate removal 3. **Decontamination**: removal of samples overlapping with evaluation benchmarks (MinHash + n-gram) 4. **Response distillation**: teacher-model long-CoT generation; repetition suppression; optional verification & regeneration 5. **Weighted stratified sampling**: uses **response length as a proxy for difficulty**, balancing hard reasoning density and broad coverage ## Experimental Results ![](./imgs/math_evaluation_res.png) ![](./imgs/stem_evaluation_res.png) ## Quickstart ### Load with `datasets` ```python from datasets import load_dataset ds = load_dataset("Logics-MLLM/Logics-STEM-SFT-Dataset-2.2M") print(ds) print(ds["train"][0]) ``` ## Data Collection We collect questions from the following publicly available datasets: - **NuminaMath-1.5** (AI-MO/NuminaMath-1.5) - **OpenThoughts3** (OpenThoughts) - **Mixture-of-Thoughts / Open-R1** (Hugging Face Open-R1 related data recipe) - **AceReason-1.1-SFT** (AceReason-Nemotron 1.1 SFT data) - **AceReason-Math** - **OpenScienceReasoning-2** (nvidia/OpenScienceReasoning-2) - **OpenMathReasoning** (OpenMathReasoning; AIMO-2 winning solution dataset) - **Llama-Nemotron-Post-Training-Dataset** (nvidia Llama-Nemotron post-training dataset) - **DLR-Web** (DESIGNER: web-derived synthetic set) - **DLR-Book** (DESIGNER: book-derived synthetic set) - **Skywork-OR1-RL-Data** (Skywork Open Reasoner RL data) - **NaturalReasoning** (NaturalReasoning) - **DeepMath-103K** (DeepMath-103K) - **DAPO-Math-17K** (DAPO Math 17K) - **TheoremQA** (TheoremQA) - **JEEBench** (JEEBench) - **GPQA-Main** (GPQA) - **GSM8K** (GSM8K) - **AIME** (di-zhang-fdu/AIME_1983_2024) - **AMC** (kaggle-aimo/amc_filtered) - **s1-teasers** (s1 dataset) - **s1-probs** (s1 dataset) - **openaimath** (s1 / OpenAI-math style dataset referenced by s1) *Acknowledgement: We sincerely thank the creators and maintainers of these open-source datasets for making this work possible.* ## Citation If you use this dataset, please cite our technical report: ```bibtex @misc{xu2026logicsstemempoweringllmreasoning, title={Logics-STEM: Empowering LLM Reasoning via Failure-Driven Post-Training and Document Knowledge Enhancement}, author={Mingyu Xu and Cheng Fang and Keyue Jiang and Yuqian Zheng and Yanghua Xiao and Baojian Zhou and Qifang Zhao and Suhang Zheng and Xiuwen Zhu and Jiyang Tang and Yongchi Zhao and Yijia Luo and Zhiqi Bai and Yuchi Xu and Wenbo Su and Wei Wang and Bing Zhao and Lin Qu and Xiaoxiao Xu}, year={2026}, eprint={2601.01562}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2601.01562}, } ```

# Logics-STEM-SFT-Dataset-2.2M ## 📰 更新动态 - [2026.01.05]🔥 发布了我们的[技术报告](https://arxiv.org/abs/2601.01562)。 - [2026.01.05]🔥 首次发布了[Logics-STEM-8B-SFT](https://huggingface.co/Logics-MLLM/Logics-STEM-8B-SFT)、[Logics-STEM-8B-RL](https://huggingface.co/Logics-MLLM/Logics-STEM-8B-RL) 以及 [Logics-STEM-SFT-Dataset-2.2M](https://huggingface.co/datasets/Logics-MLLM/Logics-STEM-SFT-Dataset-2.2M)。 --- # 数据集概览 ## 本数据集的定位 **Logics-STEM-SFT-Dataset-2.2M** 是一款精心筛选的**长思维链(Chain-of-Thought, CoT)监督微调(Supervised Fine-Tuning, SFT)数据集,面向STEM推理任务**,依托高质量开源数据构建,并通过严格的筛选与数据蒸馏引擎完成增强。该数据集包含由**Qwen3-235B-A22B-Thinking-2507**模型提炼的提示词-响应对,覆盖数学及更广范围的STEM领域(如物理、化学、生物、工程与计算机科学)。本数据集旨在为训练具备推理能力的大语言模型(Large Language Model, LLM)提供一款通用且高性能的SFT基准。 > 发布说明:受限于许可与权限要求,部分内部组件(如**DLR-book**)无法完全开源。因此,本仓库仅发布约158万条样本,其源自原报告中220万样本的降采样版本,且保留了一致的筛选原则与数据格式。 ## 数据集构建流程 ![](./imgs/long_cot_data_engine.png) 我们采用了包含以下阶段的数据筛选引擎: 1. **标注环节**:有效性与歧义性过滤;学科/领域划分;教育层级分类;答案类型标注;可验证答案(如适用)校验 2. **去重环节**:移除精确重复与近似重复的样本 3. **去污染环节**:通过MinHash与n-gram方法,移除与现有评估基准集重叠的样本 4. **响应蒸馏环节**:由教师模型生成长CoT响应;抑制重复内容;执行可选的验证与重生成步骤 5. **加权分层抽样**:以**响应长度作为难度代理指标**,平衡高难度推理密度与广泛的领域覆盖度 ## 实验结果 ![](./imgs/math_evaluation_res.png) ![](./imgs/stem_evaluation_res.png) ## 快速入门 ### 使用`datasets`库加载 python from datasets import load_dataset ds = load_dataset("Logics-MLLM/Logics-STEM-SFT-Dataset-2.2M") print(ds) print(ds["train"][0]) ## 数据采集 我们从以下公开可用数据集采集问题: - **NuminaMath-1.5**(AI-MO/NuminaMath-1.5) - **OpenThoughts3**(OpenThoughts) - **Mixture-of-Thoughts / Open-R1**(Hugging Face Open-R1 相关数据构建方案) - **AceReason-1.1-SFT**(AceReason-Nemotron 1.1 SFT 数据集) - **AceReason-Math** - **OpenScienceReasoning-2**(nvidia/OpenScienceReasoning-2) - **OpenMathReasoning**(OpenMathReasoning;AIMO-2 获奖方案数据集) - **Llama-Nemotron-Post-Training-Dataset**(nvidia Llama-Nemotron 后训练数据集) - **DLR-Web**(DESIGNER:网页衍生合成数据集) - **DLR-Book**(DESIGNER:书籍衍生合成数据集) - **Skywork-OR1-RL-Data**(Skywork Open Reasoner RL 数据集) - **NaturalReasoning**(NaturalReasoning) - **DeepMath-103K**(DeepMath-103K) - **DAPO-Math-17K**(DAPO Math 17K) - **TheoremQA**(TheoremQA) - **JEEBench**(JEEBench) - **GPQA-Main**(GPQA) - **GSM8K**(GSM8K) - **AIME**(di-zhang-fdu/AIME_1983_2024) - **AMC**(kaggle-aimo/amc_filtered) - **s1-teasers**(s1 数据集) - **s1-probs**(s1 数据集) - **openaimath**(s1 / 参考自 s1 的 OpenAI-math 风格数据集) *致谢:我们衷心感谢这些开源数据集的创建者与维护者,为本工作提供了宝贵的数据支撑。* ## 引用 若使用本数据集,请引用我们的技术报告: bibtex @misc{xu2026logicsstemempoweringllmreasoning, title={Logics-STEM: Empowering LLM Reasoning via Failure-Driven Post-Training and Document Knowledge Enhancement}, author={Mingyu Xu and Cheng Fang and Keyue Jiang and Yuqian Zheng and Yanghua Xiao and Baojian Zhou and Qifang Zhao and Suhang Zheng and Xiuwen Zhu and Jiyang Tang and Yongchi Zhao and Yijia Luo and Zhiqi Bai and Yuchi Xu and Wenbo Su and Wei Wang and Bing Zhao and Lin Qu and Xiaoxiao Xu}, year={2026}, eprint={2601.01562}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2601.01562}, }
提供机构:
maas
创建时间:
2026-01-05
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作