five

OpenThoughts2-1M

收藏
魔搭社区2026-01-08 更新2025-04-05 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/OpenThoughts2-1M
下载链接
链接失效反馈
官方服务:
资源简介:
<p align="center"> <img src="https://huggingface.co/datasets/open-thoughts/open-thoughts-114k/resolve/main/open_thoughts.png" width="50%"> </p> > [!NOTE] > We have released a paper for OpenThoughts! See our paper [here](https://arxiv.org/abs/2506.04178). <a href="https://github.com/bespokelabsai/curator/"> <img src="https://huggingface.co/datasets/bespokelabs/Bespoke-Stratos-17k/resolve/main/made_with_curator.png" alt="Made with Curator" width=200px> </a> # OpenThoughts2-1M ## Dataset Description - **Homepage:** https://www.open-thoughts.ai/ - **Repository:** https://github.com/open-thoughts/open-thoughts - **Point of Contact:** [Open Thoughts Team](contact@open-thoughts.ai) Open synthetic reasoning dataset with 1M high-quality examples covering math, science, code, and puzzles! [OpenThoughts2-1M](https://huggingface.co/datasets/open-thoughts/OpenThoughts2-1M) builds upon our previous [OpenThoughts-114k](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k) dataset, augmenting it with existing datasets like [OpenR1](https://huggingface.co/open-r1), as well as additional math and code reasoning data. This dataset was used to train [OpenThinker2-7B](https://huggingface.co/open-thoughts/OpenThinker2-7B) and [OpenThinker2-32B](https://huggingface.co/open-thoughts/OpenThinker2-32B). Inspect the content with rich formatting and search & filter capabilities in [Curator Viewer](https://curator.bespokelabs.ai/datasets/5bc1320f0afd45069cfada91a3b59c79?appId=022826a99b5c40619738d9ef48e06bc5). See our [blog post](https://www.open-thoughts.ai/blog/thinkagain) for more details. # OpenThinker2 Models Our OpenThinker2 models trained on this dataset are top performing models, comparable with DeepSeek-R1-Distill models. [OpenThinker2-32B](https://huggingface.co/open-thoughts/OpenThinker2-32B) | Model | Data | AIME24 | AIME25 | AMC23 | MATH500 | GPQA-D | LCBv2 | | ----------------------------------------------------------------------------------------------- | ---- | ------ | ------ | ----- | ------- | ------ | ----- | | [OpenThinker2-32B](https://huggingface.co/open-thoughts/OpenThinker2-32B) | ✅ | 76.7 | 58.7 | 94.0 | 90.8 | 64.1 | 72.5 | | [OpenThinker-32B](https://huggingface.co/open-thoughts/OpenThinker-32B) | ✅ | 68.0 | 49.3 | 95.5 | 90.6 | 63.5 | 68.6 | | [DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B) | ❌ | 74.7 | 50.0 | 96.5 | 90.0 | 65.8 | 72.3 | | [Light-R1-32B](https://huggingface.co/qihoo360/Light-R1-32B) | ✅ | 74.7 | 58.0 | 96.0 | 90.4 | 62.0 | 56.0 | | [S1.1-32B](https://huggingface.co/simplescaling/s1.1-32B) | ✅ | 59.3 | 42.7 | 91.5 | 87.4 | 62.0 | 58.7 | [OpenThinker2-7B](https://huggingface.co/open-thoughts/OpenThinker2-7B) | Model | Data | AIME24 | AIME25 | AMC23 | MATH500 | GPQA-D | LCBv2 | | --------------------------------------------------------------------------------------------- | ---- | ------ | ------ | ----- | ------- | ------ | ----------- | | [OpenThinker2-7B](https://huggingface.co/open-thoughts/OpenThinker2-7B) | ✅ | 50.0 | 33.3 | 89.5 | 88.4 | 49.3 | 55.6 | | [OpenThinker-7B](https://huggingface.co/open-thoughts/OpenThinker-7B) | ✅ | 31.3 | 23.3 | 74.5 | 83.2 | 42.9 | 38.0 | | [DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B) | ❌ | 57.3 | 33.3 | 92.0 | 89.6 | 47.3 | 48.4 | | [OlympicCoder-7B](https://huggingface.co/open-r1/OlympicCoder-7B) | ✅ | 20.7 | 15.3 | 63.0 | 74.8 | 25.3 | 55.4 | | [OpenR1-Qwen-7B](https://huggingface.co/open-r1/OpenR1-Qwen-7B) | ✅ | 48.7 | 34.7 | 88.5 | 87.8 | 21.2 | 9.5<br><br> | # Data Curation Recipe ![openthoughts2-diagram](openthoughts2-diagram.png) We used two methods to create OpenThoughts2-1M by adding to OpenThoughts-114K: 1. **Leveraging existing reasoning data generated by other members of the open source community** -- We fine-tuned Qwen-2.5-7B-Instruct models on GeneralThought, OpenR1-Math, Nemotron, Synthetic-1, KodCode and measured downstream performance on our reasoning evaluation suite. Out of the datasets that we used in these experiments, we found that OpenR1-Math performed the best overall. 2. **Sourcing and generating new code and math reasoning data** -- We sourced 11 different methodologies of generating math questions and 15 different methods for generating code questions. To determine the best data sources, we measure the downstream performance of each model on relevant reasoning benchmarks. Using 30K questions from each of the top 5 data sources for code and 12.5k questions from each of the top 4 data sources for math on top of our OpenThoughts-114K + OpenR1 mix, we generate additional math and code instructions. The final [OpenThoughts2-1M](https://huggingface.co/datasets/open-thoughts/OpenThoughts2-1M) is a combination of OpenThoughts-114k, OpenR1, and our newly generated math and code reasoning data. # Links - 📝 [OpenThoughts Paper](https://arxiv.org/abs/2506.04178) - 📊 [OpenThoughts2 and OpenThinker2 Blog Post](https://www.open-thoughts.ai/blog/thinkagain) - 💻 [Open Thoughts GitHub Repository](https://github.com/open-thoughts/open-thoughts) - 🧠 [OpenThoughts2-1M dataset](https://huggingface.co/datasets/open-thoughts/OpenThoughts2-1M) - this dataset. - 🤖 [OpenThinker2-7B model](https://huggingface.co/open-thoughts/OpenThinker2-7B) - 🤖 [OpenThinker2-32B model](https://huggingface.co/open-thoughts/OpenThinker2-32B) - 💻 [Curator Viewer](https://curator.bespokelabs.ai/datasets/5bc1320f0afd45069cfada91a3b59c79?appId=022826a99b5c40619738d9ef48e06bc5) # Visualization Inspect the content with rich formatting and search & filter capabilities in [Curator Viewer](https://curator.bespokelabs.ai/datasets/5bc1320f0afd45069cfada91a3b59c79?appId=022826a99b5c40619738d9ef48e06bc5). # Citation ``` @misc{guha2025openthoughtsdatarecipesreasoning, title={OpenThoughts: Data Recipes for Reasoning Models}, author={Etash Guha and Ryan Marten and Sedrick Keh and Negin Raoof and Georgios Smyrnis and Hritik Bansal and Marianna Nezhurina and Jean Mercat and Trung Vu and Zayne Sprague and Ashima Suvarna and Benjamin Feuer and Liangyu Chen and Zaid Khan and Eric Frankel and Sachin Grover and Caroline Choi and Niklas Muennighoff and Shiye Su and Wanjia Zhao and John Yang and Shreyas Pimpalgaonkar and Kartik Sharma and Charlie Cheng-Jie Ji and Yichuan Deng and Sarah Pratt and Vivek Ramanujan and Jon Saad-Falcon and Jeffrey Li and Achal Dave and Alon Albalak and Kushal Arora and Blake Wulfe and Chinmay Hegde and Greg Durrett and Sewoong Oh and Mohit Bansal and Saadia Gabriel and Aditya Grover and Kai-Wei Chang and Vaishaal Shankar and Aaron Gokaslan and Mike A. Merrill and Tatsunori Hashimoto and Yejin Choi and Jenia Jitsev and Reinhard Heckel and Maheswaran Sathiamoorthy and Alexandros G. Dimakis and Ludwig Schmidt}, year={2025}, eprint={2506.04178}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2506.04178}, } ```

<p align="center"> <img src="https://huggingface.co/datasets/open-thoughts/open-thoughts-114k/resolve/main/open_thoughts.png" width="50%"> </p> > [!注意] 我们已发布OpenThoughts相关论文!可[点击此处](https://arxiv.org/abs/2506.04178)查看论文。 <a href="https://github.com/bespokelabsai/curator/"> <img src="https://huggingface.co/datasets/bespokelabs/Bespoke-Stratos-17k/resolve/main/made_with_curator.png" alt="Made with Curator" width=200px> </a> # OpenThoughts2-1M ## 数据集概览 - **官方主页:** https://www.open-thoughts.ai/ - **代码仓库:** https://github.com/open-thoughts/open-thoughts - **联系方式:** [Open Thoughts 团队](contact@open-thoughts.ai) 这是一个包含100万条高质量样本的开源合成推理数据集,覆盖数学、科学、代码与谜题领域! [OpenThoughts2-1M](https://huggingface.co/datasets/open-thoughts/OpenThoughts2-1M) 基于我们此前发布的 [OpenThoughts-114k](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k) 数据集构建,通过引入 [OpenR1](https://huggingface.co/open-r1) 等现有数据集,以及新增的数学与代码推理数据对其进行了扩充。 本数据集被用于训练 [OpenThinker2-7B](https://huggingface.co/open-thoughts/OpenThinker2-7B) 与 [OpenThinker2-32B](https://huggingface.co/open-thoughts/OpenThinker2-32B) 两款模型。 可通过 [Curator Viewer](https://curator.bespokelabs.ai/datasets/5bc1320f0afd45069cfada91a3b59c79?appId=022826a99b5c40619738d9ef48e06bc5) 查看富格式内容,并使用搜索与过滤功能浏览数据集。 可查阅我们的[博客文章](https://www.open-thoughts.ai/blog/thinkagain)获取更多细节。 # OpenThinker2 系列模型 基于本数据集训练得到的OpenThinker2系列模型性能优异,可与DeepSeek-R1-Distill系列模型比肩。 ### OpenThinker2-32B 性能对比 | 模型名称 | 训练数据 | AIME24 | AIME25 | AMC23 | MATH500 | GPQA-D | LCBv2 | | ----------------------------------------------------------------------------------------------- | ---- | ------ | ------ | ----- | ------- | ------ | ----- | | [OpenThinker2-32B](https://huggingface.co/open-thoughts/OpenThinker2-32B) | ✅ | 76.7 | 58.7 | 94.0 | 90.8 | 64.1 | 72.5 | | [OpenThinker-32B](https://huggingface.co/open-thoughts/OpenThinker-32B) | ✅ | 68.0 | 49.3 | 95.5 | 90.6 | 63.5 | 68.6 | | [DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B) | ❌ | 74.7 | 50.0 | 96.5 | 90.0 | 65.8 | 72.3 | | [Light-R1-32B](https://huggingface.co/qihoo360/Light-R1-32B) | ✅ | 74.7 | 58.0 | 96.0 | 90.4 | 62.0 | 56.0 | | [S1.1-32B](https://huggingface.co/simplescaling/s1.1-32B) | ✅ | 59.3 | 42.7 | 91.5 | 87.4 | 62.0 | 58.7 | ### OpenThinker2-7B 性能对比 | 模型名称 | 训练数据 | AIME24 | AIME25 | AMC23 | MATH500 | GPQA-D | LCBv2 | | --------------------------------------------------------------------------------------------- | ---- | ------ | ------ | ----- | ------- | ------ | ----------- | | [OpenThinker2-7B](https://huggingface.co/open-thoughts/OpenThinker2-7B) | ✅ | 50.0 | 33.3 | 89.5 | 88.4 | 49.3 | 55.6 | | [OpenThinker-7B](https://huggingface.co/open-thoughts/OpenThinker-7B) | ✅ | 31.3 | 23.3 | 74.5 | 83.2 | 42.9 | 38.0 | | [DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B) | ❌ | 57.3 | 33.3 | 92.0 | 89.6 | 47.3 | 48.4 | | [OlympicCoder-7B](https://huggingface.co/open-r1/OlympicCoder-7B) | ✅ | 20.7 | 15.3 | 63.0 | 74.8 | 25.3 | 55.4 | | [OpenR1-Qwen-7B](https://huggingface.co/open-r1/OpenR1-Qwen-7B) | ✅ | 48.7 | 34.7 | 88.5 | 87.8 | 21.2 | 9.5<br><br> | # 数据集构建流程 ![openthoughts2-diagram](openthoughts2-diagram.png) 我们通过两种方式在OpenThoughts-114K的基础上构建了OpenThoughts2-1M: 1. **引入开源社区已有的推理数据**:我们基于GeneralThought、OpenR1-Math、Nemotron、Synthetic-1、KodCode等数据集对Qwen-2.5-7B-Instruct模型进行了微调,并在我们的推理评估套件上测试了下游任务性能。在实验中使用的所有数据集中,我们发现OpenR1-Math的综合表现最优。 2. **采集并生成全新的数学与代码推理数据**:我们采用了11种不同的数学题目生成方法,以及15种代码题目生成方法。为筛选最优数据源,我们在相关推理基准测试上评估了各模型的下游任务性能。最终,我们在OpenThoughts-114K与OpenR1的混合数据集基础上,分别选取代码领域表现前五的数据源各3万条题目、数学领域表现前四的数据源各1.25万条题目,以生成额外的数学与代码指令数据。 最终的 [OpenThoughts2-1M](https://huggingface.co/datasets/open-thoughts/OpenThoughts2-1M) 由OpenThoughts-114k、OpenR1以及我们全新生成的数学与代码推理数据三部分组成。 # 相关链接 - 📝 [OpenThoughts 学术论文](https://arxiv.org/abs/2506.04178) - 📊 [OpenThoughts2 与 OpenThinker2 博客文章](https://www.open-thoughts.ai/blog/thinkagain) - 💻 [Open Thoughts GitHub 代码仓库](https://github.com/open-thoughts/open-thoughts) - 🧠 [OpenThoughts2-1M 数据集](https://huggingface.co/datasets/open-thoughts/OpenThoughts2-1M) - 即本数据集。 - 🤖 [OpenThinker2-7B 模型](https://huggingface.co/open-thoughts/OpenThinker2-7B) - 🤖 [OpenThinker2-32B 模型](https://huggingface.co/open-thoughts/OpenThinker2-32B) - 💻 [Curator Viewer](https://curator.bespokelabs.ai/datasets/5bc1320f0afd45069cfada91a3b59c79?appId=022826a99b5c40619738d9ef48e06bc5) # 可视化浏览 可通过 [Curator Viewer](https://curator.bespokelabs.ai/datasets/5bc1320f0afd45069cfada91a3b59c79?appId=022826a99b5c40619738d9ef48e06bc5) 查看富格式内容,并使用搜索与过滤功能浏览数据集。 # 引用格式 @misc{guha2025openthoughtsdatarecipesreasoning, title={OpenThoughts: Data Recipes for Reasoning Models}, author={Etash Guha and Ryan Marten and Sedrick Keh and Negin Raoof and Georgios Smyrnis and Hritik Bansal and Marianna Nezhurina and Jean Mercat and Trung Vu and Zayne Sprague and Ashima Suvarna and Benjamin Feuer and Liangyu Chen and Zaid Khan and Eric Frankel and Sachin Grover and Caroline Choi and Niklas Muennighoff and Shiye Su and Wanjia Zhao and John Yang and Shreyas Pimpalgaonkar and Kartik Sharma and Charlie Cheng-Jie Ji and Yichuan Deng and Sarah Pratt and Vivek Ramanujan and Jon Saad-Falcon and Jeffrey Li and Achal Dave and Alon Albalak and Kushal Arora and Blake Wulfe and Chinmay Hegde and Greg Durrett and Sewoong Oh and Mohit Bansal and Saadia Gabriel and Aditya Grover and Kai-Wei Chang and Vaishaal Shankar and Aaron Gokaslan and Mike A. Merrill and Tatsunori Hashimoto and Yejin Choi and Jenia Jitsev and Reinhard Heckel and Maheswaran Sathiamoorthy and Alexandros G. Dimakis and Ludwig Schmidt}, year={2025}, eprint={2506.04178}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2506.04178}, }
提供机构:
maas
创建时间:
2025-04-05
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作