OpenThoughts-114k
收藏魔搭社区2026-05-17 更新2025-02-01 收录
下载链接:
https://modelscope.cn/datasets/open-thoughts/OpenThoughts-114k
下载链接
链接失效反馈官方服务:
资源简介:
<p align="center">
<img src="open_thoughts.png" width="50%">
</p>
> [!NOTE]
> We have released a paper for OpenThoughts! See our paper [here](https://arxiv.org/abs/2506.04178).
<a href="https://github.com/bespokelabsai/curator/">
<img src="https://huggingface.co/datasets/bespokelabs/Bespoke-Stratos-17k/resolve/main/made_with_curator.png" alt="Made with Curator" width=200px>
</a>
# Open-Thoughts-114k
## Dataset Description
- **Homepage:** https://www.open-thoughts.ai/
- **Repository:** https://github.com/open-thoughts/open-thoughts
- **Point of Contact:** [Open Thoughts Team](contact@open-thoughts.ai)
Open synthetic reasoning dataset with 114k high-quality examples covering math, science, code, and puzzles!
Inspect the content with rich formatting with [Curator Viewer](https://curator.bespokelabs.ai/datasets/1389c194254c4ead96daaf145505c3d1).
### Available Subsets
**default** subset containing ready-to-train data used to finetune the [OpenThinker-7B](https://huggingface.co/open-thoughts/OpenThinker-7B) and [OpenThinker-32B](https://huggingface.co/open-thoughts/OpenThinker-32B) models:
```
ds = load_dataset("open-thoughts/OpenThoughts-114k", split="train")
```
**metadata** subset containing extra columns used in dataset construction:
- `problem`
- `ground_truth_solution`
- `deepseek_reasoning`
- `deepseek_solution`
- `domain`
- `source`
- `test_cases` (code only)
- `starter_code`(code only)
```
ds = load_dataset("open-thoughts/OpenThoughts-114k", "metadata", split="train")
```
# OpenThinker Models
The numbers reported in the tables below are evaluated with our open-source tool [Evalchemy](https://github.com/mlfoundations/Evalchemy).
| | AIME24 | MATH500 | GPQA-Diamond | LCBv2 Easy | LCBv2 Medium | LCBv2 Hard | LCBv2 All |
| --------------------------- | -------- | ------- | ------------ | ----------- | ------------- | ----------- | ---------- |
| [OpenThinker-32B](https://huggingface.co/open-thoughts/OpenThinker-32B) | 66 | 90.6 | 61.6 | 95.1 | 70.9 | 26.8 | 68.9 |
| [OpenThinker-7B](https://huggingface.co/open-thoughts/OpenThinker-7B) | 31.3 | 83.0 | 42.4 | 75.3 | 28.6 | 6.5 | 39.9 |
| Bespoke-Stratos-7B | 22.7 | 79.6 | 38.9 | 71.4 | 25.2 | 0.8 | 35.8 |
| DeepSeek-R1-Distill-Qwen-7B | 60 | 88.2 | 46.9 | 79.7 | 45.1 | 14.6 | 50.1 |
| gpt-4o-0513 | 8.7 | 75.8 | 46.5 | 87.4 | 42.7 | 8.9 | 50.5 |
| o1-mini | 64 | 85.6 | 60 | 92.8 | 74.7 | 39.8 | 72.8 |
We are fully open-source. Our [model weights](https://huggingface.co/open-thoughts), [datasets](https://huggingface.co/open-thoughts), [data generation code](https://github.com/open-thoughts/open-thoughts), [evaluation code](https://github.com/mlfoundations/Evalchemy), and [training code](https://github.com/hiyouga/LLaMA-Factory) are all publicly available.
| | Open Weights | Open Data | Open Code |
|--|--------------|-----------| --------- |
|OpenThinker-32B|✅|[✅](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k)|[✅](https://github.com/open-thoughts/open-thoughts) |
|OpenThinker-7B|✅|[✅](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k)|[✅](https://github.com/open-thoughts/open-thoughts) |
|Bespoke-Stratos-7B|✅|[✅](https://huggingface.co/datasets/bespokelabs/Bespoke-Stratos-17k)|[✅](https://github.com/bespokelabsai/curator/tree/main/examples/bespoke-stratos-data-generation)|
|DeepSeek-R1-Distill models|✅|❌|❌|
|OpenAI/Gemini|❌|❌|❌|❌|
We are actively working towards improving the dataset, so please stay tuned!
# Data Curation Recipe
Code
- [BAAI/TACO](https://huggingface.co/datasets/BAAI/TACO)
- [codeparrot/apps](https://huggingface.co/datasets/codeparrot/apps)
- [deepmind/code_contests](https://huggingface.co/datasets/deepmind/code_contests)
- [MatrixStudio/Codeforces-Python-Submissions](https://huggingface.co/datasets/MatrixStudio/Codeforces-Python-Submissions)
Math
- [AI-MO/NuminaMath-CoT](https://huggingface.co/datasets/AI-MO/NuminaMath-CoT)
Science
- [camel-ai/chemistry](https://huggingface.co/datasets/camel-ai/chemistry)
- [camel-ai/biology](https://huggingface.co/datasets/camel-ai/biology)
- [camel-ai/physics](https://huggingface.co/datasets/camel-ai/physics)
Puzzle
- [INK-USC/riddle_sense](https://huggingface.co/datasets/INK-USC/riddle_sense)
Using a curated mix of the datasets above, we generate reasoning traces from DeepSeek-R1 and verify correctness to construct the final dataset.

The full code for the data generation pipeline is publicly available [in our github repo](https://github.com/open-thoughts/open-thoughts).
# Links
- 📝 [OpenThoughts Paper](https://arxiv.org/abs/2506.04178)
- 📊 [OpenThinker-32B Blog Post](https://www.open-thoughts.ai/blog/scale)
- 📊 [Measuing Reasoning with Evalchemy Blog Post](https://www.open-thoughts.ai/blog/measure)
- 📊 [Open Thoughts Launch Blog Post](https://www.open-thoughts.ai/blog/launch)
- 💻 [Open Thoughts GitHub Repository](https://github.com/open-thoughts/open-thoughts)
- 🧠 [OpenThoughts-114k dataset](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k) - this dataset.
- 🤖 [OpenThinker-32B model](https://huggingface.co/open-thoughts/OpenThinker-32B)
- 🤖 [OpenThinker-7B model](https://huggingface.co/open-thoughts/OpenThinker-7B)
- 📊 [Bespoke-Stratos Blog Post](https://www.bespokelabs.ai/blog/bespoke-stratos-the-unreasonable-effectiveness-of-reasoning-distillation)
- 🧠 [Bespoke-Stratos-17k dataset](https://huggingface.co/datasets/bespokelabs/Bespoke-Stratos-17k)
- 🤖 [Bespoke-Stratos-32B model](https://huggingface.co/bespokelabs/Bespoke-Stratos-32B)
- 🤖 [Bespoke-Stratos-7B model](https://huggingface.co/bespokelabs/Bespoke-Stratos-7B)
- 💻 [Curator Viewer](https://curator.bespokelabs.ai/datasets/1389c194254c4ead96daaf145505c3d1)
## Visualization
Inspect the content with rich formatting with [Curator Viewer](https://curator.bespokelabs.ai/datasets/1389c194254c4ead96daaf145505c3d1)
All 114k examples, clustered by semantic similarity, can be explored in [Nomic Atlas](https://atlas.nomic.ai/data/nomic/openthoughts-114k/map).
<a href="https://atlas.nomic.ai/data/nomic/openthoughts-114k/map">
<img src="https://cdn-uploads.huggingface.co/production/uploads/630bfb6b86b8b9904c35f4d1/d7TjezV6R3OnIDlEVL1Rl.png" alt="Nomic Atlas Open-Thoughts-114k Map" width="35%"/>
</a>
# Citation
```
@misc{guha2025openthoughtsdatarecipesreasoning,
title={OpenThoughts: Data Recipes for Reasoning Models},
author={Etash Guha and Ryan Marten and Sedrick Keh and Negin Raoof and Georgios Smyrnis and Hritik Bansal and Marianna Nezhurina and Jean Mercat and Trung Vu and Zayne Sprague and Ashima Suvarna and Benjamin Feuer and Liangyu Chen and Zaid Khan and Eric Frankel and Sachin Grover and Caroline Choi and Niklas Muennighoff and Shiye Su and Wanjia Zhao and John Yang and Shreyas Pimpalgaonkar and Kartik Sharma and Charlie Cheng-Jie Ji and Yichuan Deng and Sarah Pratt and Vivek Ramanujan and Jon Saad-Falcon and Jeffrey Li and Achal Dave and Alon Albalak and Kushal Arora and Blake Wulfe and Chinmay Hegde and Greg Durrett and Sewoong Oh and Mohit Bansal and Saadia Gabriel and Aditya Grover and Kai-Wei Chang and Vaishaal Shankar and Aaron Gokaslan and Mike A. Merrill and Tatsunori Hashimoto and Yejin Choi and Jenia Jitsev and Reinhard Heckel and Maheswaran Sathiamoorthy and Alexandros G. Dimakis and Ludwig Schmidt},
year={2025},
eprint={2506.04178},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2506.04178},
}
```
<p align="center">
<img src="open_thoughts.png" width="50%">
</p>
> 【注】我们已发布关于OpenThoughts的学术论文,可点击[此处](https://arxiv.org/abs/2506.04178)查阅。
<a href="https://github.com/bespokelabsai/curator/">
<img src="https://huggingface.co/datasets/bespokelabs/Bespoke-Stratos-17k/resolve/main/made_with_curator.png" alt="使用Curator制作" width=200px>
</a>
# Open-Thoughts-114k
## 数据集描述
- **主页:** https://www.open-thoughts.ai/
- **仓库地址:** https://github.com/open-thoughts/open-thoughts
- **联系方式:** [Open Thoughts 团队](contact@open-thoughts.ai)
本数据集为开源合成推理数据集,包含11.4万条高质量样本,覆盖数学、科学、代码与谜题四大领域!
可通过[Curator查看器(Curator Viewer)](https://curator.bespokelabs.ai/datasets/1389c194254c4ead96daaf145505c3d1)查看富格式的数据集内容。
### 可用子集
**default** 子集:包含用于微调[OpenThinker-7B(OpenThinker-7B)](https://huggingface.co/open-thoughts/OpenThinker-7B)与[OpenThinker-32B(OpenThinker-32B)](https://huggingface.co/open-thoughts/OpenThinker-32B)模型的就绪训练数据:
ds = load_dataset("open-thoughts/OpenThoughts-114k", split="train")
**metadata(元数据)** 子集:包含数据集构建过程中使用的额外列:
- `problem`:问题
- `ground_truth_solution`:标准答案解法
- `deepseek_reasoning`:DeepSeek推理过程
- `deepseek_solution`:DeepSeek解法
- `domain`:所属领域
- `source`:数据来源
- `test_cases`(仅代码类样本):测试用例
- `starter_code`(仅代码类样本):初始代码
ds = load_dataset("open-thoughts/OpenThoughts-114k", "metadata", split="train")
# OpenThinker 模型系列
下表中报告的模型性能指标均通过我们的开源工具[Evalchemy(Evalchemy)](https://github.com/mlfoundations/Evalchemy)评测得到。
| | AIME24 | MATH500 | GPQA-Diamond | LCBv2 简单赛道 | LCBv2 中等赛道 | LCBv2 困难赛道 | LCBv2 全部赛道 |
| --------------------------- | -------- | ------- | ------------ | ----------- | ------------- | ----------- | ---------- |
| [OpenThinker-32B(OpenThinker-32B)](https://huggingface.co/open-thoughts/OpenThinker-32B) | 66 | 90.6 | 61.6 | 95.1 | 70.9 | 26.8 | 68.9 |
| [OpenThinker-7B(OpenThinker-7B)](https://huggingface.co/open-thoughts/OpenThinker-7B) | 31.3 | 83.0 | 42.4 | 75.3 | 28.6 | 6.5 | 39.9 |
| Bespoke-Stratos-7B | 22.7 | 79.6 | 38.9 | 71.4 | 25.2 | 0.8 | 35.8 |
| DeepSeek-R1-Distill-Qwen-7B | 60 | 88.2 | 46.9 | 79.7 | 45.1 | 14.6 | 50.1 |
| gpt-4o-0513 | 8.7 | 75.8 | 46.5 | 87.4 | 42.7 | 8.9 | 50.5 |
| o1-mini | 64 | 85.6 | 60 | 92.8 | 74.7 | 39.8 | 72.8 |
我们完全开源。我们的[模型权重](https://huggingface.co/open-thoughts)、[数据集](https://huggingface.co/open-thoughts)、[数据生成代码](https://github.com/open-thoughts/open-thoughts)、[评测代码](https://github.com/mlfoundations/Evalchemy)以及[训练代码](https://github.com/hiyouga/LLaMA-Factory)均已公开上线。
| | 开源权重 | 开源数据 | 开源代码 |
|--|--------------|-----------| --------- |
|OpenThinker-32B|✅|[✅](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k)|[✅](https://github.com/open-thoughts/open-thoughts) |
|OpenThinker-7B|✅|[✅](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k)|[✅](https://github.com/open-thoughts/open-thoughts) |
|Bespoke-Stratos-7B|✅|[✅](https://huggingface.co/datasets/bespokelabs/Bespoke-Stratos-17k)|[✅](https://github.com/bespokelabsai/curator/tree/main/examples/bespoke-stratos-data-generation)|
|DeepSeek-R1-Distill models|✅|❌|❌|
|OpenAI/Gemini|❌|❌|❌|
我们正持续优化该数据集,敬请期待!
# 数据构建流程
## 代码类数据源
- [BAAI/TACO](https://huggingface.co/datasets/BAAI/TACO)
- [codeparrot/apps](https://huggingface.co/datasets/codeparrot/apps)
- [deepmind/code_contests](https://huggingface.co/datasets/deepmind/code_contests)
- [MatrixStudio/Codeforces-Python-Submissions](https://huggingface.co/datasets/MatrixStudio/Codeforces-Python-Submissions)
## 数学类数据源
- [AI-MO/NuminaMath-CoT](https://huggingface.co/datasets/AI-MO/NuminaMath-CoT)
## 科学类数据源
- [camel-ai/chemistry](https://huggingface.co/datasets/camel-ai/chemistry)
- [camel-ai/biology](https://huggingface.co/datasets/camel-ai/biology)
- [camel-ai/physics](https://huggingface.co/datasets/camel-ai/physics)
## 谜题类数据源
- [INK-USC/riddle_sense](https://huggingface.co/datasets/INK-USC/riddle_sense)
我们基于上述精选数据集的混合样本,从DeepSeek-R1(DeepSeek-R1)模型生成推理轨迹,并验证结果正确性,以此构建最终数据集。

数据生成流水线的完整代码已在我们的[GitHub仓库](https://github.com/open-thoughts/open-thoughts)中公开。
# 相关链接
- 📝 [OpenThoughts 学术论文](https://arxiv.org/abs/2506.04178)
- 📊 [OpenThinker-32B 官方博客](https://www.open-thoughts.ai/blog/scale)
- 📊 [基于Evalchemy的推理能力评测博客](https://www.open-thoughts.ai/blog/measure)
- 📊 [Open Thoughts 项目上线博客](https://www.open-thoughts.ai/blog/launch)
- 💻 [Open Thoughts GitHub 仓库](https://github.com/open-thoughts/open-thoughts)
- 🧠 [OpenThoughts-114k 数据集](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k) - 即本数据集。
- 🤖 [OpenThinker-32B 模型](https://huggingface.co/open-thoughts/OpenThinker-32B)
- 🤖 [OpenThinker-7B 模型](https://huggingface.co/open-thoughts/OpenThinker-7B)
- 📊 [Bespoke-Stratos 官方博客](https://www.bespokelabs.ai/blog/bespoke-stratos-the-unreasonable-effectiveness-of-reasoning-distillation)
- 🧠 [Bespoke-Stratos-17k 数据集](https://huggingface.co/datasets/bespokelabs/Bespoke-Stratos-17k)
- 🤖 [Bespoke-Stratos-32B 模型](https://huggingface.co/bespokelabs/Bespoke-Stratos-32B)
- 🤖 [Bespoke-Stratos-7B 模型](https://huggingface.co/bespokelabs/Bespoke-Stratos-7B)
- 💻 [Curator查看器](https://curator.bespokelabs.ai/datasets/1389c194254c4ead96daaf145505c3d1)
## 可视化探索
可通过[Curator查看器(Curator Viewer)](https://curator.bespokelabs.ai/datasets/1389c194254c4ead96daaf145505c3d1)查看富格式的数据集内容。
所有11.4万条样本已按语义相似度聚类,可通过[Nomic Atlas(Nomic Atlas)](https://atlas.nomic.ai/data/nomic/openthoughts-114k/map)进行探索。
<a href="https://atlas.nomic.ai/data/nomic/openthoughts-114k/map">
<img src="https://cdn-uploads.huggingface.co/production/uploads/630bfb6b86b8b9904c35f4d1/d7TjezV6R3OnIDlEVL1Rl.png" alt="Nomic Atlas Open-Thoughts-114k 可视化地图" width="35%"/>
</a>
## 引用格式
@misc{guha2025openthoughtsdatarecipesreasoning,
title={OpenThoughts: Data Recipes for Reasoning Models},
author={Etash Guha and Ryan Marten and Sedrick Keh and Negin Raoof and Georgios Smyrnis and Hritik Bansal and Marianna Nezhurina and Jean Mercat and Trung Vu and Zayne Sprague and Ashima Suvarna and Benjamin Feuer and Liangyu Chen and Zaid Khan and Eric Frankel and Sachin Grover and Caroline Choi and Niklas Muennighoff and Shiye Su and Wanjia Zhao and John Yang and Shreyas Pimpalgaonkar and Kartik Sharma and Charlie Cheng-Jie Ji and Yichuan Deng and Sarah Pratt and Vivek Ramanujan and Jon Saad-Falcon and Jeffrey Li and Achal Dave and Alon Albalak and Kushal Arora and Blake Wulfe and Chinmay Hegde and Greg Durrett and Sewoong Oh and Mohit Bansal and Saadia Gabriel and Aditya Grover and Kai-Wei Chang and Vaishaal Shankar and Aaron Gokaslan and Mike A. Merrill and Tatsunori Hashimoto and Yejin Choi and Jenia Jitsev and Reinhard Heckel and Maheswaran Sathiamoorthy and Alexandros G. Dimakis and Ludwig Schmidt},
year={2025},
eprint={2506.04178},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2506.04178},
}
提供机构:
maas
创建时间:
2025-01-29
搜集汇总
数据集介绍

背景与挑战
背景概述
OpenThoughts-114k是一个开源的合成推理数据集,包含11.4万条高质量示例,覆盖数学、科学、代码和谜题等多个领域。数据集提供两个子集,支持内容检查,并且所有相关资源(模型权重、数据和代码)均公开可用。
以上内容由遇见数据集搜集并总结生成



