SYNTHETIC-1-SFT-Data
收藏魔搭社区2026-04-28 更新2025-05-17 收录
下载链接:
https://modelscope.cn/datasets/PrimeIntellect/SYNTHETIC-1-SFT-Data
下载链接
链接失效反馈官方服务:
资源简介:
# SYNTHETIC-1: Two Million Crowdsourced Reasoning Traces from Deepseek-R1

SYNTHETIC-1 is a reasoning dataset obtained from Deepseek-R1, generated with crowdsourced compute and annotated with diverse verifiers such as LLM judges or symbolic mathematics verifiers. This is the SFT version of the dataset - the raw data and preference dataset can be found in our [🤗 SYNTHETIC-1 Collection](https://huggingface.co/collections/PrimeIntellect/synthetic-1-67a2c399cfdd6c9f7fae0c37).
The dataset consists of the following tasks and verifiers that were implemented in our library [genesys](https://github.com/PrimeIntellect-ai/genesys):
### **Mathematics Problems (777k samples):**
- Tasks: Competition-Level Math Problems from [NuminaMath](https://huggingface.co/datasets/AI-MO/NuminaMath-CoT), with LLM-based post-processing to turn multiple-choice questions into free form questions and to filter out questions without automatically verifiable responses (e.g. questions asking for proofs)
- Verifier: Symbolic verification based on the [math-verify](https://github.com/huggingface/Math-Verify) library
- Task Dataset: [PrimeIntellect/verifiable-math-problems](http://huggingface.co/datasets/PrimeIntellect/verifiable-math-problems)
### **Algorithmic Coding Problems (144k samples):**
- Tasks: Algorithmic Challenges from coding competitions and platforms such as Leetcode, curated from [Apps](https://huggingface.co/datasets/codeparrot/apps), [Codecontests](https://huggingface.co/datasets/deepmind/code_contests), [Codeforces](https://huggingface.co/datasets/MatrixStudio/Codeforces-Python-Submissions) and [TACO](https://huggingface.co/datasets/BAAI/TACO) datasets. LLM-based post-processing was applied to additionally translate Python problems into Javascript, Rust and C++ problems
- Verifier: Containerized execution of unit tests
- Task Dataset: [PrimeIntellect/verifiable-coding-problems](https://huggingface.co/datasets/PrimeIntellect/verifiable-coding-problems)
### **Real-World Software Engineering Problems (70k samples):**
- **Tasks:** Derived from real-world GitHub commits in the [CommitPack](https://huggingface.co/datasets/bigcode/commitpackft) dataset. Each problem pairs a pre-commit code file with an LLM-generated modification instruction, crafted using context from the original commit message and the post-commit file state.
- **Verifier:** An LLM judge compares LLM-generated code against the actual post-commit file state.
- **Task Dataset:** [PrimeIntellect/real-world-swe-problems](https://huggingface.co/datasets/PrimeIntellect/real-world-swe-problems)
### **Open-Ended STEM Question Answering (313k samples):**
- **Tasks:** Questions curated from a broad range of technical and scientific topics using the [StackExchange dataset](https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences). LLM-based filtering retains only those questions with objectively correct responses, excluding opinion-based queries, and only keeps questions that require genuine reasoning rather than simple recall or memorization of information.w
- **Verifier:** An LLM judge scores responses by comparing them to the most upvoted answer.
- **Task Dataset:** [PrimeIntellect/stackexchange-question-answering](https://huggingface.co/datasets/PrimeIntellect/stackexchange-question-answering)
### **Synthetic Code Understanding Tasks (61k samples):**
- **Tasks:** Fully synthetic task where the goal is to predict the output of code that performs string transformations given the code and some string input. We generate arbitrary string-processing functions via LLM prompting and recursively increase their complexity using a scheme akin to [evol-instruct](https://arxiv.org/pdf/2304.12244). Inputs include both random strings and snippets from news articles, with ground truth outputs obtained by executing the generated code.
- **Verifier:** LLM-predicted output strings are directly compared with real output strings and are judged as correct when an exact match occurs.
- **Task Dataset:** [PrimeIntellect/synthetic-code-understanding](https://huggingface.co/datasets/PrimeIntellect/synthetic-code-understanding)
## Citation
Feel free to cite SYNTHETIC-1 if you have found it useful for your work
```bib
@misc{2025synthetic1,
title={SYNTHETIC-1: Two Million Collaboratively Generated Reasoning Traces from Deepseek-R1},
author={Justus Mattern and Sami Jaghouar and Manveer Basra and Jannik Straube and Matthew Di Ferrante and Felix Gabriel and Jack Min Ong and Vincent Weisser and Johannes Hagemann},
year={2025},
url={https://www.primeintellect.ai/blog/synthetic-1-release},
}
```
# SYNTHETIC-1:源自Deepseek-R1的200万份众包推理轨迹数据集

SYNTHETIC-1是一款源自Deepseek-R1的推理数据集,通过众包算力生成,并由多种验证器(包括大语言模型(Large Language Model,LLM)评判器或符号数学验证器)进行标注。本数据集为**监督微调(Supervised Fine-Tuning,SFT)版本**,原始数据与偏好数据集可在我们的[🤗 SYNTHETIC-1 数据集集合](https://huggingface.co/collections/PrimeIntellect/synthetic-1-67a2c399cfdd6c9f7fae0c37)中获取。
本数据集包含以下任务与验证器,相关实现基于我们的开源库[genesys](https://github.com/PrimeIntellect-ai/genesys):
### **数学问题(777k条样本):**
- 任务来源:源自[NuminaMath](https://huggingface.co/datasets/AI-MO/NuminaMath-CoT)的竞赛级数学问题,经基于大语言模型的后处理,将多项选择题转换为自由作答题目,并过滤掉无法自动验证答案的问题(例如要求撰写证明的题目)
- 验证器:基于[math-verify](https://github.com/huggingface/Math-Verify)库的符号化验证
- 任务数据集:[PrimeIntellect/verifiable-math-problems](http://huggingface.co/datasets/PrimeIntellect/verifiable-math-problems)
### **算法编程问题(144k条样本):**
- 任务来源:源自编程竞赛与平台(如Leetcode)的算法挑战,整理自[Apps](https://huggingface.co/datasets/codeparrot/apps)、[Codecontests](https://huggingface.co/datasets/deepmind/code_contests)、[Codeforces-Python-Submissions](https://huggingface.co/datasets/MatrixStudio/Codeforces-Python-Submissions)与[TACO](https://huggingface.co/datasets/BAAI/TACO)数据集。经基于大语言模型的后处理,额外将Python题目转换为JavaScript、Rust与C++题目
- 验证器:单元测试的容器化执行验证
- 任务数据集:[PrimeIntellect/verifiable-coding-problems](https://huggingface.co/datasets/PrimeIntellect/verifiable-coding-problems)
### **真实世界软件工程问题(70k条样本):**
- **任务**:源自[CommitPack](https://huggingface.co/datasets/bigcode/commitpackft)数据集中的真实GitHub提交。每个问题均将提交前的代码文件与大语言模型生成的修改指令配对,修改指令基于原始提交信息与提交后的文件状态上下文生成
- **验证器**:由大语言模型评判器将大语言模型生成的代码与实际提交后的文件状态进行比对
- **任务数据集**:[PrimeIntellect/real-world-swe-problems](https://huggingface.co/datasets/PrimeIntellect/real-world-swe-problems)
### **开放式STEM问答任务(313k条样本):**
- **任务**:从[StackExchange数据集](https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences)的广泛技术与科学主题中精选问题。经基于大语言模型的过滤,仅保留拥有客观正确答案的问题(排除主观类提问),且仅保留需要真正推理而非简单回忆或记忆信息的问题
- **验证器**:由大语言模型评判器将生成的回答与获赞最多的答案进行比对并评分
- **任务数据集**:[PrimeIntellect/stackexchange-question-answering](https://huggingface.co/datasets/PrimeIntellect/stackexchange-question-answering)
### **合成代码理解任务(61k条样本):**
- **任务**:完全合成的任务,目标为在给定代码与部分字符串输入的前提下,预测执行字符串转换操作的代码输出。我们通过大语言模型提示生成任意字符串处理函数,并采用类似[evol-instruct](https://arxiv.org/pdf/2304.12244)的方案递归提升函数复杂度。输入包含随机字符串与新闻文章片段,真实输出通过执行生成的代码获得
- **验证器**:将大语言模型预测的输出字符串与真实输出字符串直接比对,完全匹配则判定为正确
- **任务数据集**:[PrimeIntellect/synthetic-code-understanding](https://huggingface.co/datasets/PrimeIntellect/synthetic-code-understanding)
## 引用
若您的研究工作中使用了SYNTHETIC-1数据集,请引用如下:
bib
@misc{2025synthetic1,
title={SYNTHETIC-1: Two Million Collaboratively Generated Reasoning Traces from Deepseek-R1},
author={Justus Mattern and Sami Jaghouar and Manveer Basra and Jannik Straube and Matthew Di Ferrante and Felix Gabriel and Jack Min Ong and Vincent Weisser and Johannes Hagemann},
year={2025},
url={https://www.primeintellect.ai/blog/synthetic-1-release},
}
提供机构:
maas
创建时间:
2025-05-13



