OpenSeek-Synthetic-Reasoning-Data-Examples
收藏魔搭社区2026-01-02 更新2025-03-01 收录
下载链接:
https://modelscope.cn/datasets/BAAI/OpenSeek-Synthetic-Reasoning-Data-Examples
下载链接
链接失效反馈官方服务:
资源简介:
# OpenSeek-Reasoning-Data
OpenSeek [[Github](https://github.com/FlagAI-Open/OpenSeek)|[Blog](https://hub.baai.ac.cn/view/43443)]
Recent reseach has demonstrated that the reasoning ability of LLMs originates from the pre-training stage, activated by RL training. Massive raw corpus containing complex human reasoning process, but lack of generalized and effective synthesis method to extract these reasoning process.
## **News**
- 🔥🔥🔥[2025/02/25] We publish some math, code, and general knowledge domain reasoning data synthesized from the current pipeline.
## **Source Corpus**
| Domain | Dataset | Data Volume (B) |
|:-------:|:-------------------:|:---------------:|
| Math | Proof-pile-2 | 100 |
| | FineMath | 88 |
| | Dolmino | 1708 |
| Code | OpenCoder-Annealing | 6 |
| | StarCoder | 15 |
| | OpenCoder-LLM | 51 |
| General | FineWeb-edu | 476 |
| | CCI3-HQ | 163 |
| | Nemotron-CC | 4119 |
| | Dolma | 70 |
## **Data Formats**
- id: Unique sample identifier.
- raw: The original document before synthesis.
- instruction: Core questions extracted from the original document.
- Chain-of-thought: A chain of thought that summarizes the original document after segmenting and summarizing.
- text: Synthetic data samples used during pre-training.
## Reasoning Data Synthesis Pipeline V1.0

# OpenSeek-Reasoning-Data
OpenSeek [["Github"](https://github.com/FlagAI-Open/OpenSeek)|["Blog"](https://hub.baai.ac.cn/view/43443)]
近期研究表明,大语言模型(Large Language Model,LLM)的推理能力源自预训练阶段,并通过强化学习(Reinforcement Learning,RL)训练得以激活。当前存在海量蕴含复杂人类推理过程的原始语料,但缺乏通用且有效的合成方法来提取其中的推理流程。
## **新闻**
- 🔥🔥🔥[2025/02/25] 我们发布了基于当前流水线合成的数学、代码与通用知识领域推理数据。
## **源语料库**
| 领域 | 数据集名称 | 数据量(B) |
|:-------:|:-------------------:|:---------------:|
| 数学 | Proof-pile-2 | 100 |
| | FineMath | 88 |
| | Dolmino | 1708 |
| 代码 | OpenCoder-Annealing | 6 |
| | StarCoder | 15 |
| | OpenCoder-LLM | 51 |
| 通用知识 | FineWeb-edu | 476 |
| | CCI3-HQ | 163 |
| | Nemotron-CC | 4119 |
| | Dolma | 70 |
## **数据格式**
- id:唯一样本标识符。
- raw:合成前的原始文档。
- instruction:从原始文档中提取的核心问题。
- Chain-of-thought(思维链):经分段与摘要处理后,对原始文档进行归纳的思维链序列。
- text:预训练阶段使用的合成数据样本。
## 推理数据合成流水线V1.0

提供机构:
maas
创建时间:
2025-02-26



