togethercomputer/CoderForge-Preview
收藏Hugging Face2026-02-26 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/togethercomputer/CoderForge-Preview
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: trajectories
features:
- name: trajectory_id
dtype: string
- name: finish_reason
dtype: string
- name: image
dtype: string
- name: messages
dtype: string
- name: reward
dtype: float64
- name: tools
dtype: string
- name: license
dtype: string
splits:
- name: SWE_Rebench
num_bytes: 19392208677
num_examples: 77169
- name: SWE_Smith
num_bytes: 33088967556
num_examples: 148001
- name: R2E_Gym
num_bytes: 6869123922
num_examples: 32964
- name: filtered_reward1
num_bytes: 33547502194
num_examples: 155144
download_size: 22788997561
dataset_size: 92897802349
- config_name: trajectories-tokenized_qwencoder
features:
- name: trajectory_id
dtype: string
- name: reward
dtype: float64
- name: chat_template_applied
dtype: string
- name: input_ids
list: int32
- name: labels
list: int64
splits:
- name: SWE_Rebench
num_bytes: 64238782798
num_examples: 77169
- name: SWE_Smith
num_bytes: 107118447512
num_examples: 148001
- name: R2E_Gym
num_bytes: 23869485518
num_examples: 32964
- name: filtered_reward1
num_bytes: 108349044091
num_examples: 155144
download_size: 49985669802
dataset_size: 303575759919
configs:
- config_name: trajectories
data_files:
- split: SWE_Rebench
path: trajectories/SWE_Rebench-*
- split: SWE_Smith
path: trajectories/SWE_Smith-*
- split: R2E_Gym
path: trajectories/R2E_Gym-*
- split: filtered_reward1
path: trajectories/filtered_reward1-*
- config_name: trajectories-tokenized_qwencoder
data_files:
- split: SWE_Rebench
path: trajectories-tokenized_qwencoder/SWE_Rebench-*
- split: SWE_Smith
path: trajectories-tokenized_qwencoder/SWE_Smith-*
- split: R2E_Gym
path: trajectories-tokenized_qwencoder/R2E_Gym-*
- split: filtered_reward1
path: trajectories-tokenized_qwencoder/filtered_reward1-*
---
# CoderForge-Preview: SOTA Open Dataset for Training Efficient Agents
**CoderForge-Preview** is **the** **largest open test-verified coding agent dataset.**
Fine-tuning Qwen-3 32B on it, we boost **SWE-Bench Verified performance** **23.0% → 59.4% pass@1** and rank **#1 among open-data** and **#2 among open-weight models ≤32B parameters.**


## Limitations
- **Adaptability to different scaffolds:** We generated all trajectories using a **single scaffold** and **fixed tool set** (no permutations). Models trained via SFT on this data may perform worse when deployed with **different scaffolds, tools, prompt templates, or tool-call formats**.
- **Task scope:** Our data sources skew toward **bug fixing**. As a result, models trained on this dataset may be less capable on tasks outside that scope, such as **feature implementation**, **refactors**, or **design-heavy changes**.
- **User interaction:** Real coding-agent usage often involves **ongoing user collaboration**, with user messages appearing throughout the trajectory—not just at the start. This kind of interactive supervision is still largely missing from open coding-agent datasets (including ours). Models trained on SFT alone may therefore underperform in **interactive settings**.
## Conclusion
In this release, we focus on **large-scale agentic data generation**: assembling **51K distinct open-source tasks** and generating **long-horizon, multi-step SFT trajectories**. Our results show that a simple data-generation pipeline combined with **pure SFT** can produce substantial gains in coding-agent performance.
### Next steps
Moving forward, we plan to:
- **Scale data generation further** (more tasks, more trajectories, longer horizons where helpful)
- Generate data under **multiple scaffolds**, **tool sets**, and **prompt/tool-call permutations** to improve robustness and transfer
- Train **larger models** and run more systematic **hyperparameter tuning**
- Follow the **DeepSWE** training paradigm by applying **agentic reinforcement learning** on top of our fine-tuned model to drive further performance gains
## Citation
```bibtex
@misc{CoderForge2026,
title = {CoderForge-Preview: SOTA Open Dataset for Training Efficient Agents},
author = {Ariyak, Alpay and Zhang, Junda and Wang, Junxiong and Zhu, Shang and Bianchi, Federico and Srivastava, Sanjana and Panda, Ashwinee and Bharti, Siddhant and Xu, Chenfeng and Heo, John and Wu, Xiaoxia Shirley and Zhou, James and Liang, Percy and Song, Leon and Zhang, Ce and Athiwaratkun, Ben and Zhou, Zhongzhu and Wu, Qingyang},
year = {2026},
month = feb,
publisher = {TogetherAI Blog},
url = {https://www.together.ai/blog/coderforge-preview},
note = {Project core leads: Alpay Ariyak; Zhongzhu Zhou; Qingyang Wu}
}
```
数据集信息:
- 配置名称:trajectories(轨迹集)
特征:
- 轨迹ID(trajectory_id):字符串类型
- 终止原因(finish_reason):字符串类型
- 图像(image):字符串类型
- 对话消息(messages):字符串类型
- 奖励值(reward):64位双精度浮点数(float64)
- 工具集(tools):字符串类型
- 许可证(license):字符串类型
划分集:
- 名称:SWE_Rebench,字节数:19392208677,样本数:77169
- 名称:SWE_Smith,字节数:33088967556,样本数:148001
- 名称:R2E_Gym,字节数:6869123922,样本数:32964
- 名称:filtered_reward1,字节数:33547502194,样本数:155144
总下载大小:22788997561字节,数据集总大小:92897802349字节
- 配置名称:trajectories-tokenized_qwencoder(经过分词的Qwen编码器轨迹集)
特征:
- 轨迹ID(trajectory_id):字符串类型
- 奖励值(reward):64位双精度浮点数(float64)
- 已应用对话模板(chat_template_applied):字符串类型
- 输入Token序列(input_ids):int32整数列表
- 标签(labels):int64整数列表
划分集:
- 名称:SWE_Rebench,字节数:64238782798,样本数:77169
- 名称:SWE_Smith,字节数:107118447512,样本数:148001
- 名称:R2E_Gym,字节数:23869485518,样本数:32964
- 名称:filtered_reward1,字节数:108349044091,样本数:155144
总下载大小:49985669802字节,数据集总大小:303575759919字节
配置:
- 配置名称trajectories:数据文件对应如下划分:
- 划分SWE_Rebench:路径为trajectories/SWE_Rebench-*
- 划分SWE_Smith:路径为trajectories/SWE_Smith-*
- 划分R2E_Gym:路径为trajectories/R2E_Gym-*
- 划分filtered_reward1:路径为trajectories/filtered_reward1-*
- 配置名称trajectories-tokenized_qwencoder:数据文件对应如下划分:
- 划分SWE_Rebench:路径为trajectories-tokenized_qwencoder/SWE_Rebench-*
- 划分SWE_Smith:路径为trajectories-tokenized_qwencoder/SWE_Smith-*
- 划分R2E_Gym:路径为trajectories-tokenized_qwencoder/R2E_Gym-*
- 划分filtered_reward1:路径为trajectories-tokenized_qwencoder/filtered_reward1-*
# CoderForge-Preview:用于训练高效AI智能体(AI Agent)的前沿开放数据集
**CoderForge-Preview**是目前规模最大的经过测试验证的开源编码智能体数据集。
我们在该数据集上对Qwen-3 32B大语言模型(Large Language Model,LLM)进行微调后,其**SWE-Bench验证集准确率**从23.0%提升至59.4%的pass@1指标,在**开放数据赛道中排名第一**,在**参数≤32B的开源权重模型**中排名第二。


## 局限性
- **多脚手架适配能力**:本数据集所有轨迹均基于**单一脚手架**与**固定工具集**生成(未进行排列组合)。基于该数据进行监督微调(Supervised Fine-Tuning,SFT)的模型,在部署时若使用**不同脚手架、工具集、提示模板或工具调用格式**,性能可能出现下降。
- **任务范围偏向**:本数据集的数据源主要偏向**缺陷修复任务**,因此基于该数据集训练的模型,在**功能实现、代码重构或重设计类任务**上的表现可能有所不足。
- **用户交互场景缺失**:真实的编码智能体使用场景通常涉及**持续的用户协作**,对话消息会贯穿整个轨迹过程,而非仅出现在初始阶段。当前开源编码智能体数据集(包括本数据集)仍普遍缺乏此类交互式监督数据。因此仅通过监督微调训练的模型,在**交互式场景**中的表现可能欠佳。
## 结论
在本次发布中,我们聚焦于**大规模智能体数据生成**:整合了**51,000个独立开源任务**,并生成了**长周期、多步骤的监督微调轨迹**。我们的实验结果表明,一套简单的数据生成流水线结合**纯监督微调**,即可大幅提升编码智能体的性能表现。
### 后续规划
后续我们计划:
- **进一步扩大数据生成规模**:增加任务数量、轨迹条数,并在必要时拓展轨迹的周期长度
- 在**多种脚手架、工具集与提示/工具调用排列组合**下生成数据,以提升模型的鲁棒性与迁移能力
- 训练**更大参数规模的模型**,并开展更系统的**超参数调优**
- 遵循**DeepSWE**训练范式,在微调后的模型基础上应用**智能体强化学习**,以进一步提升模型性能
## 引用
bibtex
@misc{CoderForge2026,
title = {CoderForge-Preview: 用于训练高效AI智能体的前沿开放数据集},
author = {Ariyak, Alpay与Zhang, Junda与Wang, Junxiong与Zhu, Shang与Bianchi, Federico与Srivastava, Sanjana与Panda, Ashwinee与Bharti, Siddhant与Xu, Chenfeng与Heo, John与Wu, Xiaoxia Shirley与Zhou, James与Liang, Percy与Song, Leon与Zhang, Ce与Athiwaratkun, Ben与Zhou, Zhongzhu与Wu, Qingyang},
year = {2026},
month = {2月},
publisher = {TogetherAI博客},
url = {https://www.together.ai/blog/coderforge-preview},
note = {项目核心负责人:Alpay Ariyak、Zhongzhu Zhou、Qingyang Wu}
}
提供机构:
togethercomputer



