SYNTHETIC-1-Preference-Data
收藏魔搭社区2025-12-05 更新2025-05-17 收录
下载链接:
https://modelscope.cn/datasets/PrimeIntellect/SYNTHETIC-1-Preference-Data
下载链接
链接失效反馈官方服务:
资源简介:
# SYNTHETIC-1: Two Million Crowdsourced Reasoning Traces from Deepseek-R1

SYNTHETIC-1 is a reasoning dataset obtained from Deepseek-R1, generated with crowdsourced compute and annotated with diverse verifiers such as LLM judges or symbolic mathematics verifiers. This is the SFT version of the dataset - the raw data and SFT dataset can be found in our [🤗 SYNTHETIC-1 Collection](https://huggingface.co/collections/PrimeIntellect/synthetic-1-67a2c399cfdd6c9f7fae0c37).
The dataset consists of the following tasks and verifiers that were implemented in our library [genesys](https://github.com/PrimeIntellect-ai/genesys):
### **Mathematics Problems (777k samples):**
- Tasks: Competition-Level Math Problems from [NuminaMath](https://huggingface.co/datasets/AI-MO/NuminaMath-CoT), with LLM-based post-processing to turn multiple-choice questions into free form questions and to filter out questions without automatically verifiable responses (e.g. questions asking for proofs)
- Verifier: Symbolic verification based on the [math-verify](https://github.com/huggingface/Math-Verify) library
- Task Dataset: [PrimeIntellect/verifiable-math-problems](http://huggingface.co/datasets/PrimeIntellect/verifiable-math-problems)
### **Algorithmic Coding Problems (144k samples):**
- Tasks: Algorithmic Challenges from coding competitions and platforms such as Leetcode, curated from [Apps](https://huggingface.co/datasets/codeparrot/apps), [Codecontests](https://huggingface.co/datasets/deepmind/code_contests), [Codeforces](https://huggingface.co/datasets/MatrixStudio/Codeforces-Python-Submissions) and [TACO](https://huggingface.co/datasets/BAAI/TACO) datasets. LLM-based post-processing was applied to additionally translate Python problems into Javascript, Rust and C++ problems
- Verifier: Containerized execution of unit tests
- Task Dataset: [PrimeIntellect/verifiable-coding-problems](https://huggingface.co/datasets/PrimeIntellect/verifiable-coding-problems)
### **Real-World Software Engineering Problems (70k samples):**
- **Tasks:** Derived from real-world GitHub commits in the [CommitPack](https://huggingface.co/datasets/bigcode/commitpackft) dataset. Each problem pairs a pre-commit code file with an LLM-generated modification instruction, crafted using context from the original commit message and the post-commit file state.
- **Verifier:** An LLM judge compares LLM-generated code against the actual post-commit file state.
- **Task Dataset:** [PrimeIntellect/real-world-swe-problems](https://huggingface.co/datasets/PrimeIntellect/real-world-swe-problems)
### **Open-Ended STEM Question Answering (313k samples):**
- **Tasks:** Questions curated from a broad range of technical and scientific topics using the [StackExchange dataset](https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences). LLM-based filtering retains only those questions with objectively correct responses, excluding opinion-based queries, and only keeps questions that require genuine reasoning rather than simple recall or memorization of information.w
- **Verifier:** An LLM judge scores responses by comparing them to the most upvoted answer.
- **Task Dataset:** [PrimeIntellect/stackexchange-question-answering](https://huggingface.co/datasets/PrimeIntellect/stackexchange-question-answering)
### **Synthetic Code Understanding Tasks (61k samples):**
- **Tasks:** Fully synthetic task where the goal is to predict the output of code that performs string transformations given the code and some string input. We generate arbitrary string-processing functions via LLM prompting and recursively increase their complexity using a scheme akin to [evol-instruct](https://arxiv.org/pdf/2304.12244). Inputs include both random strings and snippets from news articles, with ground truth outputs obtained by executing the generated code.
- **Verifier:** LLM-predicted output strings are directly compared with real output strings and are judged as correct when an exact match occurs.
- **Task Dataset:** [PrimeIntellect/synthetic-code-understanding](https://huggingface.co/datasets/PrimeIntellect/synthetic-code-understanding)
## Citation
Feel free to cite SYNTHETIC-1 if you have found it useful for your work
```bib
@misc{2025synthetic1,
title={SYNTHETIC-1: Two Million Collaboratively Generated Reasoning Traces from Deepseek-R1},
author={Justus Mattern and Sami Jaghouar and Manveer Basra and Jannik Straube and Matthew Di Ferrante and Felix Gabriel and Jack Min Ong and Vincent Weisser and Johannes Hagemann},
year={2025},
url={https://www.primeintellect.ai/blog/synthetic-1-release},
}
```
# SYNTHETIC-1:源自Deepseek-R1的200万条众包推理轨迹

SYNTHETIC-1是一款依托Deepseek-R1生成的推理数据集,通过众包算力构建,并由多种验证器(如大语言模型(Large Language Model,LLM)评判器、符号数学验证器)完成标注。本数据集为监督微调(Supervised Fine-Tuning,SFT)版本,原始数据与SFT数据集可在我们的[🤗 SYNTHETIC-1 数据集集合](https://huggingface.co/collections/PrimeIntellect/synthetic-1-67a2c399cfdd6c9f7fae0c37)中获取。
该数据集包含以下任务与验证器,相关实现基于我们的开源库[genesys](https://github.com/PrimeIntellect-ai/genesys):
### **数学问题(777k样本):**
- 任务:源自[NuminaMath](https://huggingface.co/datasets/AI-MO/NuminaMath-CoT)的竞赛级数学试题,经基于大语言模型的后处理,将选择题转换为自由作答题型,并过滤掉无法自动验证答案的题目(例如要求证明的试题)
- 验证器:基于[math-verify](https://github.com/huggingface/Math-Verify)库的符号化验证
- 任务数据集:[PrimeIntellect/verifiable-math-problems](http://huggingface.co/datasets/PrimeIntellect/verifiable-math-problems)
### **算法编程题(144k样本):**
- 任务:源自编程竞赛与平台(如Leetcode)的算法挑战,整合自[Apps](https://huggingface.co/datasets/codeparrot/apps)、[Codecontests](https://huggingface.co/datasets/deepmind/code_contests)、[Codeforces](https://huggingface.co/datasets/MatrixStudio/Codeforces-Python-Submissions)与[TACO](https://huggingface.co/datasets/BAAI/TACO)数据集。经基于大语言模型的后处理,将Python试题额外转换为JavaScript、Rust与C++题型
- 验证器:单元测试的容器化执行
- 任务数据集:[PrimeIntellect/verifiable-coding-problems](https://huggingface.co/datasets/PrimeIntellect/verifiable-coding-problems)
### **现实世界软件工程问题(70k样本):**
- **任务**:源自[CommitPack](https://huggingface.co/datasets/bigcode/commitpackft)数据集中的真实GitHub提交。每个问题均将提交前的代码文件与大语言模型生成的修改指令配对,修改指令基于原始提交信息与提交后的文件状态生成
- **验证器**:由大语言模型评判器将模型生成的代码与实际提交后的文件状态进行比对
- 任务数据集:[PrimeIntellect/real-world-swe-problems](https://huggingface.co/datasets/PrimeIntellect/real-world-swe-problems)
### **开放式STEM问答任务(313k样本):**
- **任务**:基于[StackExchange数据集](https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences),从广泛的技术与科学主题中精选问题。经大语言模型过滤,仅保留具备客观正确答案的问题,排除主观类查询,且仅保留需要真正推理而非简单回忆或记忆信息的问题
- **验证器**:由大语言模型评判器将模型生成的答案与获赞最多的回答进行比对并评分
- 任务数据集:[PrimeIntellect/stackexchange-question-answering](https://huggingface.co/datasets/PrimeIntellect/stackexchange-question-answering)
### **合成代码理解任务(61k样本):**
- **任务**:完全合成的任务,目标为根据给定的代码与部分字符串输入,预测执行该字符串转换代码后的输出。我们通过大语言模型提示生成任意字符串处理函数,并采用类似[evol-instruct](https://arxiv.org/pdf/2304.12244)的方案递归提升函数复杂度。输入包含随机字符串与新闻文章片段,真实输出通过执行生成的代码获得
- **验证器**:将大语言模型预测的输出字符串与真实输出字符串直接比对,完全匹配则判定为正确
- 任务数据集:[PrimeIntellect/synthetic-code-understanding](https://huggingface.co/datasets/PrimeIntellect/synthetic-code-understanding)
## 引用
若您的研究工作中使用了SYNTHETIC-1数据集,欢迎引用该成果:
bib
@misc{2025synthetic1,
title={SYNTHETIC-1: Two Million Collaboratively Generated Reasoning Traces from Deepseek-R1},
author={Justus Mattern and Sami Jaghouar and Manveer Basra and Jannik Straube and Matthew Di Ferrante and Felix Gabriel and Jack Min Ong and Vincent Weisser and Johannes Hagemann},
year={2025},
url={https://www.primeintellect.ai/blog/synthetic-1-release},
}
提供机构:
maas
创建时间:
2025-05-13



