OpenSeek-Synthetic-Reasoning-Data-Examples

Name: OpenSeek-Synthetic-Reasoning-Data-Examples
Creator: maas
Published: 2026-01-02 16:24:26
License: 暂无描述

魔搭社区2026-01-02 更新2025-03-01 收录

下载链接：

https://modelscope.cn/datasets/BAAI/OpenSeek-Synthetic-Reasoning-Data-Examples

下载链接

链接失效反馈

官方服务：

资源简介：

# OpenSeek-Reasoning-Data OpenSeek [[Github](https://github.com/FlagAI-Open/OpenSeek)|[Blog](https://hub.baai.ac.cn/view/43443)] Recent reseach has demonstrated that the reasoning ability of LLMs originates from the pre-training stage, activated by RL training. Massive raw corpus containing complex human reasoning process, but lack of generalized and effective synthesis method to extract these reasoning process. ## **News** - 🔥🔥🔥[2025/02/25] We publish some math, code, and general knowledge domain reasoning data synthesized from the current pipeline. ## **Source Corpus** | Domain | Dataset | Data Volume (B) | |:-------:|:-------------------:|:---------------:| | Math | Proof-pile-2 | 100 | | | FineMath | 88 | | | Dolmino | 1708 | | Code | OpenCoder-Annealing | 6 | | | StarCoder | 15 | | | OpenCoder-LLM | 51 | | General | FineWeb-edu | 476 | | | CCI3-HQ | 163 | | | Nemotron-CC | 4119 | | | Dolma | 70 | ## **Data Formats** - id: Unique sample identifier. - raw: The original document before synthesis. - instruction: Core questions extracted from the original document. - Chain-of-thought: A chain of thought that summarizes the original document after segmenting and summarizing. - text: Synthetic data samples used during pre-training. ## Reasoning Data Synthesis Pipeline V1.0 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6397e7c22fe4fee54933f6c2/tDJCsPRgyFe2QhTF35Ca7.png)

# OpenSeek-Reasoning-Data OpenSeek [["Github"](https://github.com/FlagAI-Open/OpenSeek)|["Blog"](https://hub.baai.ac.cn/view/43443)] 近期研究表明，大语言模型（Large Language Model，LLM）的推理能力源自预训练阶段，并通过强化学习（Reinforcement Learning，RL）训练得以激活。当前存在海量蕴含复杂人类推理过程的原始语料，但缺乏通用且有效的合成方法来提取其中的推理流程。 ## **新闻** - 🔥🔥🔥[2025/02/25] 我们发布了基于当前流水线合成的数学、代码与通用知识领域推理数据。 ## **源语料库** | 领域 | 数据集名称 | 数据量（B） | |:-------:|:-------------------:|:---------------:| | 数学 | Proof-pile-2 | 100 | | | FineMath | 88 | | | Dolmino | 1708 | | 代码 | OpenCoder-Annealing | 6 | | | StarCoder | 15 | | | OpenCoder-LLM | 51 | | 通用知识 | FineWeb-edu | 476 | | | CCI3-HQ | 163 | | | Nemotron-CC | 4119 | | | Dolma | 70 | ## **数据格式** - id：唯一样本标识符。 - raw：合成前的原始文档。 - instruction：从原始文档中提取的核心问题。 - Chain-of-thought（思维链）：经分段与摘要处理后，对原始文档进行归纳的思维链序列。 - text：预训练阶段使用的合成数据样本。 ## 推理数据合成流水线V1.0 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6397e7c22fe4fee54933f6c2/tDJCsPRgyFe2QhTF35Ca7.png)

提供机构：

maas

创建时间：

2025-02-26

5,000+

优质数据集

54 个

任务类型

进入经典数据集