SYNTHETIC-1
收藏魔搭社区2026-05-12 更新2025-05-17 收录
下载链接:
https://modelscope.cn/datasets/PrimeIntellect/SYNTHETIC-1
下载链接
链接失效反馈官方服务:
资源简介:
# SYNTHETIC-1: Two Million Crowdsourced Reasoning Traces from Deepseek-R1

SYNTHETIC-1 is a reasoning dataset obtained from Deepseek-R1, generated with crowdsourced compute and annotated with diverse verifiers such as LLM judges or symbolic mathematics verifiers. This is the raw version of the dataset, without any filtering for correctness - Filtered datasets specifically for fine-tuning as well as our 7B model can be found in our [🤗 SYNTHETIC-1 Collection](https://huggingface.co/collections/PrimeIntellect/synthetic-1-67a2c399cfdd6c9f7fae0c37).
The dataset consists of the following tasks and verifiers that were implemented in our library [genesys](https://github.com/PrimeIntellect-ai/genesys):
### **Mathematics Problems (777k samples):**
- Tasks: Competition-Level Math Problems from [NuminaMath](https://huggingface.co/datasets/AI-MO/NuminaMath-CoT), with LLM-based post-processing to turn multiple-choice questions into free form questions and to filter out questions without automatically verifiable responses (e.g. questions asking for proofs)
- Verifier: Symbolic verification based on the [math-verify](https://github.com/huggingface/Math-Verify) library
- Task Dataset: [PrimeIntellect/verifiable-math-problems](http://huggingface.co/datasets/PrimeIntellect/verifiable-math-problems)
### **Algorithmic Coding Problems (144k samples):**
- Tasks: Algorithmic Challenges from coding competitions and platforms such as Leetcode, curated from [Apps](https://huggingface.co/datasets/codeparrot/apps), [Codecontests](https://huggingface.co/datasets/deepmind/code_contests), [Codeforces](https://huggingface.co/datasets/MatrixStudio/Codeforces-Python-Submissions) and [TACO](https://huggingface.co/datasets/BAAI/TACO) datasets. LLM-based post-processing was applied to additionally translate Python problems into Javascript, Rust and C++ problems
- Verifier: Containerized execution of unit tests
- Task Dataset: [PrimeIntellect/verifiable-coding-problems](https://huggingface.co/datasets/PrimeIntellect/verifiable-coding-problems)
### **Real-World Software Engineering Problems (70k samples):**
- **Tasks:** Derived from real-world GitHub commits in the [CommitPack](https://huggingface.co/datasets/bigcode/commitpackft) dataset. Each problem pairs a pre-commit code file with an LLM-generated modification instruction, crafted using context from the original commit message and the post-commit file state.
- **Verifier:** An LLM judge compares LLM-generated code against the actual post-commit file state.
- **Task Dataset:** [PrimeIntellect/real-world-swe-problems](https://huggingface.co/datasets/PrimeIntellect/real-world-swe-problems)
### **Open-Ended STEM Question Answering (313k samples):**
- **Tasks:** Questions curated from a broad range of technical and scientific topics using the [StackExchange dataset](https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences). LLM-based filtering retains only those questions with objectively correct responses, excluding opinion-based queries, and only keeps questions that require genuine reasoning rather than simple recall or memorization of information.w
- **Verifier:** An LLM judge scores responses by comparing them to the most upvoted answer.
- **Task Dataset:** [PrimeIntellect/stackexchange-question-answering](https://huggingface.co/datasets/PrimeIntellect/stackexchange-question-answering)
### **Synthetic Code Understanding Tasks (61k samples):**
- **Tasks:** Fully synthetic task where the goal is to predict the output of code that performs string transformations given the code and some string input. We generate arbitrary string-processing functions via LLM prompting and recursively increase their complexity using a scheme akin to [evol-instruct](https://arxiv.org/pdf/2304.12244). Inputs include both random strings and snippets from news articles, with ground truth outputs obtained by executing the generated code.
- **Verifier:** LLM-predicted output strings are directly compared with real output strings and are judged as correct when an exact match occurs.
- **Task Dataset:** [PrimeIntellect/synthetic-code-understanding](https://huggingface.co/datasets/PrimeIntellect/synthetic-code-understanding)
## Citation
Feel free to cite SYNTHETIC-1 if you have found it useful for your work
```bib
@misc{2025synthetic1,
title={SYNTHETIC-1: Two Million Collaboratively Generated Reasoning Traces from Deepseek-R1},
author={Justus Mattern and Sami Jaghouar and Manveer Basra and Jannik Straube and Matthew Di Ferrante and Felix Gabriel and Jack Min Ong and Vincent Weisser and Johannes Hagemann},
year={2025},
url={https://www.primeintellect.ai/blog/synthetic-1-release},
}
```
# SYNTHETIC-1:源自Deepseek-R1的200万条众包推理轨迹

SYNTHETIC-1是一款源自Deepseek-R1的推理数据集,依托众包算力生成,并由大语言模型(Large Language Model,LLM)评判器、符号数学校验器等多种校验工具完成标注。本数据集为原始版本,未针对正确性进行任何过滤——专为微调任务打造的过滤后数据集以及我们的7B参数模型,可在[🤗 SYNTHETIC-1 数据集合集](https://huggingface.co/collections/PrimeIntellect/synthetic-1-67a2c399cfdd6c9f7fae0c37)中获取。
本数据集包含以下任务类型与校验方案,均基于我们的[genesys](https://github.com/PrimeIntellect-ai/genesys)工具库实现:
### **数学题(77.7万条样本):**
- 任务源:源自[NuminaMath](https://huggingface.co/datasets/AI-MO/NuminaMath-CoT)的竞赛级数学题,经基于大语言模型的后处理,将选择题转换为开放作答题型,并过滤掉无法自动校验答案的题目(例如要求证明的题型)。
- 校验方案:基于[math-verify](https://github.com/huggingface/Math-Verify)工具库的符号化校验。
- 任务数据集:[PrimeIntellect/verifiable-math-problems](http://huggingface.co/datasets/PrimeIntellect/verifiable-math-problems)
### **算法编程题(14.4万条样本):**
- 任务源:从[Apps](https://huggingface.co/datasets/codeparrot/apps)、[Codecontests](https://huggingface.co/datasets/deepmind/code_contests)、[Codeforces](https://huggingface.co/datasets/MatrixStudio/Codeforces-Python-Submissions)以及[TACO](https://huggingface.co/datasets/BAAI/TACO)等数据集遴选而来的编程竞赛与Leetcode等平台的算法挑战题。经基于大语言模型的后处理,额外将Python题型转换为JavaScript、Rust与C++题型。
- 校验方案:通过容器化执行单元测试完成校验。
- 任务数据集:[PrimeIntellect/verifiable-coding-problems](https://huggingface.co/datasets/PrimeIntellect/verifiable-coding-problems)
### **真实世界软件工程题(7.0万条样本):**
- **任务源**:源自[CommitPack](https://huggingface.co/datasets/bigcode/commitpackft)数据集中的真实GitHub提交记录。每道题目将提交前的代码文件与大语言模型生成的修改指令相结合,修改指令基于原始提交信息与提交后文件状态的上下文构建。
- **校验方案**:由大语言模型评判器将模型生成的代码与实际提交后的文件状态进行比对校验。
- **任务数据集**:[PrimeIntellect/real-world-swe-problems](https://huggingface.co/datasets/PrimeIntellect/real-world-swe-problems)
### **开放域STEM问答(31.3万条样本):**
- **任务源**:基于[StackExchange数据集](https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences),从广泛的技术与科学主题中整理而来的问题。经大语言模型过滤后,仅保留具备客观正确答案的问题(排除主观观点类提问),且仅保留需要真正推理而非简单回忆或记忆信息的题目。
- **校验方案**:由大语言模型评判器将模型生成的回答与获赞最多的标准答案进行比对并打分。
- **任务数据集**:[PrimeIntellect/stackexchange-question-answering](https://huggingface.co/datasets/PrimeIntellect/stackexchange-question-answering)
### **合成代码理解任务(6.1万条样本):**
- **任务设计**:完全合成的任务,目标为在给定代码与字符串输入的前提下,预测该代码执行字符串转换后的输出。我们通过大语言模型提示生成任意字符串处理函数,并采用类似[evol-instruct](https://arxiv.org/pdf/2304.12244)的方案递归提升函数复杂度。输入包含随机字符串与新闻文章片段,真实输出通过执行生成的代码获得。
- **校验方案**:将大语言模型预测的输出字符串与真实输出字符串直接比对,完全匹配则判定为正确。
- **任务数据集**:[PrimeIntellect/synthetic-code-understanding](https://huggingface.co/datasets/PrimeIntellect/synthetic-code-understanding)
## 引用
若您的工作中使用了本数据集,欢迎引用SYNTHETIC-1:
bib
@misc{2025synthetic1,
title={SYNTHETIC-1: Two Million Collaboratively Generated Reasoning Traces from Deepseek-R1},
author={Justus Mattern and Sami Jaghouar and Manveer Basra and Jannik Straube and Matthew Di Ferrante and Felix Gabriel and Jack Min Ong and Vincent Weisser and Johannes Hagemann},
year={2025},
url={https://www.primeintellect.ai/blog/synthetic-1-release},
}
提供机构:
maas
创建时间:
2025-05-13



