togethercomputer/CoderForge-Preview

Name: togethercomputer/CoderForge-Preview
Creator: togethercomputer
Published: 2026-02-26 18:22:08
License: 暂无描述

Hugging Face2026-02-26 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/togethercomputer/CoderForge-Preview

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: - config_name: trajectories features: - name: trajectory_id dtype: string - name: finish_reason dtype: string - name: image dtype: string - name: messages dtype: string - name: reward dtype: float64 - name: tools dtype: string - name: license dtype: string splits: - name: SWE_Rebench num_bytes: 19392208677 num_examples: 77169 - name: SWE_Smith num_bytes: 33088967556 num_examples: 148001 - name: R2E_Gym num_bytes: 6869123922 num_examples: 32964 - name: filtered_reward1 num_bytes: 33547502194 num_examples: 155144 download_size: 22788997561 dataset_size: 92897802349 - config_name: trajectories-tokenized_qwencoder features: - name: trajectory_id dtype: string - name: reward dtype: float64 - name: chat_template_applied dtype: string - name: input_ids list: int32 - name: labels list: int64 splits: - name: SWE_Rebench num_bytes: 64238782798 num_examples: 77169 - name: SWE_Smith num_bytes: 107118447512 num_examples: 148001 - name: R2E_Gym num_bytes: 23869485518 num_examples: 32964 - name: filtered_reward1 num_bytes: 108349044091 num_examples: 155144 download_size: 49985669802 dataset_size: 303575759919 configs: - config_name: trajectories data_files: - split: SWE_Rebench path: trajectories/SWE_Rebench-* - split: SWE_Smith path: trajectories/SWE_Smith-* - split: R2E_Gym path: trajectories/R2E_Gym-* - split: filtered_reward1 path: trajectories/filtered_reward1-* - config_name: trajectories-tokenized_qwencoder data_files: - split: SWE_Rebench path: trajectories-tokenized_qwencoder/SWE_Rebench-* - split: SWE_Smith path: trajectories-tokenized_qwencoder/SWE_Smith-* - split: R2E_Gym path: trajectories-tokenized_qwencoder/R2E_Gym-* - split: filtered_reward1 path: trajectories-tokenized_qwencoder/filtered_reward1-* --- # CoderForge-Preview: SOTA Open Dataset for Training Efficient Agents **CoderForge-Preview** is **the** **largest open test-verified coding agent dataset.** Fine-tuning Qwen-3 32B on it, we boost **SWE-Bench Verified performance** **23.0% → 59.4% pass@1** and rank **#1 among open-data** and **#2 among open-weight models ≤32B parameters.** ![top_open_data_models](https://cdn-uploads.huggingface.co/production/uploads/63972847b3e2256c9ce1307b/UG0fXsbVAMxoxxuC0kRLe.png) ![top_open_weight_models](https://cdn-uploads.huggingface.co/production/uploads/63972847b3e2256c9ce1307b/y6MKx8AQTGkeG4N8kGDqd.png) ## Limitations - **Adaptability to different scaffolds:** We generated all trajectories using a **single scaffold** and **fixed tool set** (no permutations). Models trained via SFT on this data may perform worse when deployed with **different scaffolds, tools, prompt templates, or tool-call formats**. - **Task scope:** Our data sources skew toward **bug fixing**. As a result, models trained on this dataset may be less capable on tasks outside that scope, such as **feature implementation**, **refactors**, or **design-heavy changes**. - **User interaction:** Real coding-agent usage often involves **ongoing user collaboration**, with user messages appearing throughout the trajectory—not just at the start. This kind of interactive supervision is still largely missing from open coding-agent datasets (including ours). Models trained on SFT alone may therefore underperform in **interactive settings**. ## Conclusion In this release, we focus on **large-scale agentic data generation**: assembling **51K distinct open-source tasks** and generating **long-horizon, multi-step SFT trajectories**. Our results show that a simple data-generation pipeline combined with **pure SFT** can produce substantial gains in coding-agent performance. ### Next steps Moving forward, we plan to: - **Scale data generation further** (more tasks, more trajectories, longer horizons where helpful) - Generate data under **multiple scaffolds**, **tool sets**, and **prompt/tool-call permutations** to improve robustness and transfer - Train **larger models** and run more systematic **hyperparameter tuning** - Follow the **DeepSWE** training paradigm by applying **agentic reinforcement learning** on top of our fine-tuned model to drive further performance gains ## Citation ```bibtex @misc{CoderForge2026, title = {CoderForge-Preview: SOTA Open Dataset for Training Efficient Agents}, author = {Ariyak, Alpay and Zhang, Junda and Wang, Junxiong and Zhu, Shang and Bianchi, Federico and Srivastava, Sanjana and Panda, Ashwinee and Bharti, Siddhant and Xu, Chenfeng and Heo, John and Wu, Xiaoxia Shirley and Zhou, James and Liang, Percy and Song, Leon and Zhang, Ce and Athiwaratkun, Ben and Zhou, Zhongzhu and Wu, Qingyang}, year = {2026}, month = feb, publisher = {TogetherAI Blog}, url = {https://www.together.ai/blog/coderforge-preview}, note = {Project core leads: Alpay Ariyak; Zhongzhu Zhou; Qingyang Wu} } ```

数据集信息： - 配置名称：trajectories（轨迹集）特征： - 轨迹ID（trajectory_id）：字符串类型 - 终止原因（finish_reason）：字符串类型 - 图像（image）：字符串类型 - 对话消息（messages）：字符串类型 - 奖励值（reward）：64位双精度浮点数（float64） - 工具集（tools）：字符串类型 - 许可证（license）：字符串类型划分集： - 名称：SWE_Rebench，字节数：19392208677，样本数：77169 - 名称：SWE_Smith，字节数：33088967556，样本数：148001 - 名称：R2E_Gym，字节数：6869123922，样本数：32964 - 名称：filtered_reward1，字节数：33547502194，样本数：155144 总下载大小：22788997561字节，数据集总大小：92897802349字节 - 配置名称：trajectories-tokenized_qwencoder（经过分词的Qwen编码器轨迹集）特征： - 轨迹ID（trajectory_id）：字符串类型 - 奖励值（reward）：64位双精度浮点数（float64） - 已应用对话模板（chat_template_applied）：字符串类型 - 输入Token序列（input_ids）：int32整数列表 - 标签（labels）：int64整数列表划分集： - 名称：SWE_Rebench，字节数：64238782798，样本数：77169 - 名称：SWE_Smith，字节数：107118447512，样本数：148001 - 名称：R2E_Gym，字节数：23869485518，样本数：32964 - 名称：filtered_reward1，字节数：108349044091，样本数：155144 总下载大小：49985669802字节，数据集总大小：303575759919字节配置： - 配置名称trajectories：数据文件对应如下划分： - 划分SWE_Rebench：路径为trajectories/SWE_Rebench-* - 划分SWE_Smith：路径为trajectories/SWE_Smith-* - 划分R2E_Gym：路径为trajectories/R2E_Gym-* - 划分filtered_reward1：路径为trajectories/filtered_reward1-* - 配置名称trajectories-tokenized_qwencoder：数据文件对应如下划分： - 划分SWE_Rebench：路径为trajectories-tokenized_qwencoder/SWE_Rebench-* - 划分SWE_Smith：路径为trajectories-tokenized_qwencoder/SWE_Smith-* - 划分R2E_Gym：路径为trajectories-tokenized_qwencoder/R2E_Gym-* - 划分filtered_reward1：路径为trajectories-tokenized_qwencoder/filtered_reward1-* # CoderForge-Preview：用于训练高效AI智能体（AI Agent）的前沿开放数据集 **CoderForge-Preview**是目前规模最大的经过测试验证的开源编码智能体数据集。我们在该数据集上对Qwen-3 32B大语言模型（Large Language Model，LLM）进行微调后，其**SWE-Bench验证集准确率**从23.0%提升至59.4%的pass@1指标，在**开放数据赛道中排名第一**，在**参数≤32B的开源权重模型**中排名第二。 ![顶尖开放数据模型排名](https://cdn-uploads.huggingface.co/production/uploads/63972847b3e2256c9ce1307b/UG0fXsbVAMxoxxuC0kRLe.png) ![顶尖开源权重模型排名](https://cdn-uploads.huggingface.co/production/uploads/63972847b3e2256c9ce1307b/y6MKx8AQTGkeG4N8kGDqd.png) ## 局限性 - **多脚手架适配能力**：本数据集所有轨迹均基于**单一脚手架**与**固定工具集**生成（未进行排列组合）。基于该数据进行监督微调（Supervised Fine-Tuning，SFT）的模型，在部署时若使用**不同脚手架、工具集、提示模板或工具调用格式**，性能可能出现下降。 - **任务范围偏向**：本数据集的数据源主要偏向**缺陷修复任务**，因此基于该数据集训练的模型，在**功能实现、代码重构或重设计类任务**上的表现可能有所不足。 - **用户交互场景缺失**：真实的编码智能体使用场景通常涉及**持续的用户协作**，对话消息会贯穿整个轨迹过程，而非仅出现在初始阶段。当前开源编码智能体数据集（包括本数据集）仍普遍缺乏此类交互式监督数据。因此仅通过监督微调训练的模型，在**交互式场景**中的表现可能欠佳。 ## 结论在本次发布中，我们聚焦于**大规模智能体数据生成**：整合了**51,000个独立开源任务**，并生成了**长周期、多步骤的监督微调轨迹**。我们的实验结果表明，一套简单的数据生成流水线结合**纯监督微调**，即可大幅提升编码智能体的性能表现。 ### 后续规划后续我们计划： - **进一步扩大数据生成规模**：增加任务数量、轨迹条数，并在必要时拓展轨迹的周期长度 - 在**多种脚手架、工具集与提示/工具调用排列组合**下生成数据，以提升模型的鲁棒性与迁移能力 - 训练**更大参数规模的模型**，并开展更系统的**超参数调优** - 遵循**DeepSWE**训练范式，在微调后的模型基础上应用**智能体强化学习**，以进一步提升模型性能 ## 引用 bibtex @misc{CoderForge2026, title = {CoderForge-Preview: 用于训练高效AI智能体的前沿开放数据集}, author = {Ariyak, Alpay与Zhang, Junda与Wang, Junxiong与Zhu, Shang与Bianchi, Federico与Srivastava, Sanjana与Panda, Ashwinee与Bharti, Siddhant与Xu, Chenfeng与Heo, John与Wu, Xiaoxia Shirley与Zhou, James与Liang, Percy与Song, Leon与Zhang, Ce与Athiwaratkun, Ben与Zhou, Zhongzhu与Wu, Qingyang}, year = {2026}, month = {2月}, publisher = {TogetherAI博客}, url = {https://www.together.ai/blog/coderforge-preview}, note = {项目核心负责人：Alpay Ariyak、Zhongzhu Zhou、Qingyang Wu} }

提供机构：

togethercomputer

5,000+

优质数据集

54 个

任务类型

进入经典数据集