five

SWE-Bench-plus-plus

收藏
魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/TuringEnterprises/SWE-Bench-plus-plus
下载链接
链接失效反馈
官方服务:
资源简介:
# SWE-bench++ ## 1. Summary <div style="margin-left: 20px;"> <b>Repository:</b> <a href="https://github.com/TuringEnterprises/SWE-Bench-plus-plus">TuringEnterprises/SWE-Bench-plus-plus</a><br> <b>Evaluate Models Using:</b> <code>swebench.harness.run_evaluation</code> (see <i>Evaluation Guide</i> below) </div> In the domain of software engineering, LLM capabilities have progressed rapidly, underscoring the need for evolving evaluation frameworks. While foundational, benchmarks like SWE-bench, SWE-bench Verified, and other such variants are incomplete, with manually curated design causing scalability bottlenecks, weak test oracles, dataset aging and contamination, reproducibility challenges, and more. In response, **Turing** introduces **SWE-bench++**: a reenvisioned, innovative, end-to-end evaluation framework. It both addresses existing evaluation pain points and introduces new capabilities, positioning it as a forerunner for software reasoning evaluation and training. Our initial private validation benchmark consists of **7,000+ GitHub instances** from **1000s of repositories** across **9 languages**. We’ve made 500 of these instances publicly available, with over 80% being in the medium-to-hard difficulty range. These tasks average **120+ lines of code edited** (with a considerable number in the 1000s+ range) and **7+ files edited**. Highest performing model scores are: **gpt-5-2025-08-07 at 26.8%, claude-sonnet-4.5 at 26%, gpt5-high-reasoning at 22.7%, and claude-opus-4.1 at 22.5%**, with a drop to **<14%** for the next highest performing models. See below for more details. SWE-bench++ is far more extensive than many previously released benchmarks, as its design enables automation of nearly the entire pipeline for its creation. This pipeline is unique because of its facilitation of scale and its ability to be generalized (especially to evaluation of other, more holistic software engineering tasks). --- ## 2. Getting Started ### Evaluating Models on SWE-bench++ To evaluate your model's performance on this dataset, visit our [evaluation repository](https://github.com/TuringEnterprises/SWE-Bench-plus-plus). This repository provides: - Complete evaluation harness with Docker-based testing environments - Step-by-step setup instructions - Example prediction file formats - Comprehensive troubleshooting guides ### Quick Start ```bash # Install the evaluation framework git clone https://github.com/TuringEnterprises/SWE-Bench-plus-plus.git cd SWE-Bench-plus-plus/SWE-Bench python3 -m venv .venv source .venv/bin/activate pip install -e . # Run evaluation using this Hugging Face dataset python -m swebench.harness.run_evaluation \ --dataset_name TuringEnterprises/SWE-Bench-plus-plus \ --predictions_path <path/to/your/predictions.jsonl> \ --namespace "" \ --run_id <run_id> \ --turing_eval ``` For detailed instructions, please refer to the [Evaluation Guide](https://github.com/TuringEnterprises/SWE-Bench-plus-plus#evaluation-guide) in the repository. --- ## 3. Benchmark Construction (Methodology) We follow the framework below in our benchmark construction and evaluation pipeline. <p align="center"> <img src="assets/swe_framework.png" alt="SWE-bench++ Framework" width="700"/> <i>Figure 1: SWE-bench++ Framework</i> </p> SWE-bench++ introduces **6 key innovations** that enable this: 1. **Scalable sourcing and filtering (capture tasks):** We use heuristics to broadly select pull requests (PRs) that match our quality thresholds — active maintenance with recent commit activity, >100 stars + a recognizable testing framework, up to 10k lines of code changes, and merged PRs that explicitly close an issue. 2. **Intelligent data curation (refine tasks):** We combine agent verification with human-expert verification to ensure high-quality PRs (problems that are specific enough, hard enough, and able to be containerized). 3. **Agentic Dockerization (Dockerize tasks):** We combine two strategies to package each PR: template-based scaffolding and LLM-based containerizing (we generate Dockerfile templates for each programming language and use an agent to intelligently fill in blanks). 4. **LLM-powered quality control (validate tasks):** We employ an agent once more as the final data validation step to check for issues that may slip through in a successful Docker build (e.g., redundant steps, inaccurate test commands, etc.). 5. **Diagnostic feedback (analyze failures):** We eliminate manual engineering and debugging by using 3 states to analyze test outcomes — base, before, and after — as well as hybrid log parsing to extract test results from execution logs. Our hybrid log parser employs both a standard parser and an LLM-generated one to enable model debugging with unstructured test outputs. 6. **Automated trajectory curation for fine-tuning (turn insights into training data):** We curate agentic trajectories as the model reasons through tasks in our dataset. These trajectories serve as valuable demonstrations for fine-tuning and enable hill climbing of other SWE benchmarks. --- ## 4. Results To validate the complexity of this new dataset, we benchmarked SOTA LLM agents using **swe-agent** on 500 instances and measured **pass@1**. The wide performance gap, with pass@1 scores ranging from **26.8% down to 1%**, confirms both that the dataset is challenging and that there is a clear model hierarchy. <p align="center"> <img src="assets/resolve_rate.png" alt="SWE-bench++ Evaluation Results: Resolve Rate by Model" width="850"/> <i>Figure 2: SWE-bench++ Evaluation Results (Resolve Rate by Model)</i> </p> --- ## 5. Metadata ### Overview **SWE-bench++ (Public)** is the community-accessible release of our extended SWE-bench benchmark. It includes 500 high-quality tasks designed to evaluate the ability of LLMs and coding agents to resolve real-world GitHub issues and pull requests. This dataset prioritizes both quantity and quality of tasks, having captured, scraped, and packaged diverse, difficult, high-quality PRs. ### Key Features - **Task Scale:** 500 tasks across diverse repos and languages. - **Multilinguality:** 7 programming languages - **Repository Coverage:** 11 repo types - **Issue Coverage:** 6 issue types - **No Copyright Issues** We outline these distributions below. --- <p align="center"> <img src="assets/prog_language_distr.png" alt="Task Distribution of Coding Languages" width="700"/> <i>Figure 3: SWE-bench++ Task Distribution of Coding Languages</i> </p> <br> <p align="center"> <img src="assets/issue_type_distr.png" alt="Issue Type Distribution Across SWE-bench++ Tasks" width="700"/> <i>Figure 4: Issue Type Distribution Across SWE-bench++ Tasks</i> </p> <br> <p align="center"> <img src="assets/repo_type_distr.png" alt="Repository Type Distribution" width="700"/> <i>Figure 5: Repository Type Distribution</i> </p> Our heuristic-based sourcing step, which is intentionally coarse and fast, enables us to collect a high quantity of PRs (our initial run collected over 50,000). This size allows us to retain a high repository coverage, even as we prune for quality. <br> <p align="center"> <img src="assets/difficulty_distr.png" alt="Difficulty Level Distribution" width="700"/> <i>Figure 6: Task Difficulty Level Distribution</i> </p> We categorize difficulty level based on the number of lines of code edited and the number of files edited [placeholder, waiting for official numbers]: ``` if # lines of code edited > [x1] and # files edited > [x2]: task = hard if # lines of code edited > [y1] and # files edited > [y2]: task = medium if # lines of code edited > [z1] and # files edited > [z2]: task = easy ``` This distribution demonstrates the overall difficulty of this dataset, with over 80% of tasks being medium or above difficulty. See more metadata, including lines of code edited, files edited, and license counts in the appendix. --- ## 6. Implications and Conclusion The path to ASI resembles a three-legged race between model improvement and human evaluation: models get better, benchmarks adjust, and the cycle repeats. Essentially, models can only be systematically improved when benchmarks are rigorous enough to surface their limitations, creating a feedback loop where better models demand better benchmarks, and vice versa. Each side is dependent on the other to push forward. On the "benchmark side," SWE-bench++ gives the push ahead needed to stabilize the team. This framework both generalizes to other software engineering tasks (including those that may have non-standard build procedures or dependencies on external hardware) and paves the way for model hill-climbing and future research advancements (e.g., realistic, evolving RL gyms). SWE-bench++ sets a new standard for evaluating and training software reasoning capabilities, with its core innovations addressing leaderboard overfitting and enabling the development of models that can more robustly **reason**, **self-correct**, and **plan**. --- ## 7. Licensing and Permissions Turing Enterprises, Inc. grants you a worldwide, royalty-free, non-exclusive, non-transferable, and revocable limited license to access, use, reproduce, and create derivative works of the **Dataset** solely for **non-commercial research, academic, or educational purposes**. This license is only intended to facilitate experimentation, benchmarking, and study of the dataset. You **may NOT** use the Dataset or any derivative works for commercial purposes. If interested in commercial use, please contact <a href="mailto:yuzhao.ni@turing.com?subject=Extended SWE-bench Commercial Access" style="font-weight: bold;">yuzhao.ni@turing.com</a>. THE DATASET IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND NON-INFRINGEMENT. IN NO EVENT SHALL TURING BE LIABLE FOR ANY DIRECT OR INDIRECT CLAIMS, DAMAGES, OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE DATASET OR THE USE OR OTHER DEALINGS IN THE DATASET. --- ## 8. Appendix We include more task metadata below, emphasizing this dataset’s wide coverage. <p align="center"> <img src="assets/loc_edited_distr.png" alt="Lines of Code Edited Distribution" width="700"/> <i>Figure 7: Lines of Code Edited Distribution (From PR)</i> </p> <p align="center"> <img src="assets/files_edited_distr.png" alt="Number of Files Edited Distribution" width="700"/> <i>Figure 8: Number of Files Edited (From PR) Distribution</i> </p> <p align="center"> <img src="assets/license_distr.png" alt="Licenses" width="700"/> <i>Figure 9: Count of Licenses (From Repo)</i> </p> --- **SWE-bench++ Overview** 1. Summary 2. Getting Started 3. Benchmark Construction (Methodology) 4. Results 5. Metadata 6. Implications and Conclusion 7. Licensing and Permissions 8. Appendix

# SWE-bench++ ## 1. 摘要 <div style="margin-left: 20px;"> <b>仓库:</b> <a href="https://github.com/TuringEnterprises/SWE-Bench-plus-plus">TuringEnterprises/SWE-Bench-plus-plus</a><br> <b>模型评估工具:</b> <code>swebench.harness.run_evaluation</code>(详见下文的**评估指南**) </div> 在软件工程领域,大语言模型(LLM)的能力飞速进步,凸显了迭代升级评估框架的必要性。现有基准如SWE-bench、SWE-bench Verified及其变体仍存在诸多不足:人工甄选设计模式导致了可扩展性瓶颈、测试预言(test oracles)薄弱、数据集老化与污染、可复现性挑战等问题。 为此,**图灵(Turing)**推出**SWE-bench++**:一个重新构想、极具创新性的端到端评估框架。它既解决了现有评估中的痛点,又新增了多项能力,有望成为软件推理评估与训练领域的先行者。我们的初始私有验证基准包含来自数千个仓库、覆盖9种编程语言的**7000余个GitHub实例**。 我们已公开其中500个实例,其中超过80%属于中高难度范畴。这些任务平均需**编辑120余行代码**(其中大量任务的代码修改量超过千行),并涉及**7个以上文件的修改**。当前性能最优的模型得分如下:**gpt-5-2025-08-07 达26.8%,claude-sonnet-4.5 达26%,gpt5-high-reasoning 达22.7%,claude-opus-4.1 达22.5%**,后续性能最优模型的得分则降至**14%以下**。更多细节详见下文。 SWE-bench++的覆盖范围远超此前发布的多数基准,其设计可实现近乎全流程的自动化构建。该流程的独特之处在于其可扩展性与泛化能力(尤其适用于其他更全面的软件工程任务评估)。 --- ## 2. 快速上手 ### 在SWE-bench++上评估模型 若需评估您的模型在本数据集上的性能,请访问我们的[评估仓库](https://github.com/TuringEnterprises/SWE-Bench-plus-plus)。 该仓库提供以下内容: - 完整的评估工具链,包含基于Docker的测试环境 - 分步安装指南 - 示例预测文件格式 - 全面的故障排除指南 ### 快速启动 bash # 安装评估框架 git clone https://github.com/TuringEnterprises/SWE-Bench-plus-plus.git cd SWE-Bench-plus-plus/SWE-Bench python3 -m venv .venv source .venv/bin/activate pip install -e . # 使用该Hugging Face数据集运行评估 python -m swebench.harness.run_evaluation --dataset_name TuringEnterprises/SWE-Bench-plus-plus --predictions_path <path/to/your/predictions.jsonl> --namespace "" --run_id <run_id> --turing_eval 详细操作指南请参阅仓库中的[评估指南](https://github.com/TuringEnterprises/SWE-Bench-plus-plus#evaluation-guide)。 --- ## 3. 基准构建(方法论) 我们在基准构建与评估流程中遵循以下框架。 <p align="center"> <img src="assets/swe_framework.png" alt="SWE-bench++框架" width="700"/> <i>图1:SWE-bench++框架</i> </p> SWE-bench++通过以下6项核心创新实现该目标: 1. **可扩展的数据源获取与筛选(任务捕获):** 我们通过启发式规则广泛筛选符合质量阈值的拉取请求(PR):仓库需处于活跃维护状态且近期有提交活动、星标数超100且配备成熟的测试框架、代码修改量不超过10000行,且为明确关联并关闭对应issue的已合并PR。 2. **智能化数据甄选(任务优化):** 我们结合智能体验证与人工专家验证,确保PR的高质量(即任务描述足够具体、难度适宜且可容器化)。 3. **智能体驱动的容器化(任务容器化):** 我们结合两种策略为每个PR打包环境:基于模板的脚手架搭建与基于大语言模型的容器化操作(我们为每种编程语言生成Dockerfile模板,并通过智能体智能填充缺失内容)。 4. **大语言模型驱动的质量管控(任务验证):** 我们再次使用智能体作为最终的数据验证环节,检查Docker构建成功后仍可能存在的问题(例如冗余步骤、不准确的测试命令等)。 5. **诊断反馈(失败分析):** 我们通过三种状态(基准状态、修改前、修改后)分析测试结果,并结合混合日志解析技术从执行日志中提取测试结果,从而省去手动工程调试的工作。我们的混合日志解析器同时采用标准解析器与大语言模型生成的解析器,可针对非结构化的测试输出实现模型调试。 6. **面向微调的轨迹自动甄选(将洞察转化为训练数据):** 我们对智能体在解决本数据集任务时的推理轨迹进行甄选,这些轨迹可作为高质量的微调演示样本,同时助力其他SWE基准的性能提升。 --- ## 4. 实验结果 为验证该新数据集的复杂度,我们使用**swe-agent**在500个实例上对当前最先进的大语言模型智能体进行基准测试,并统计了**pass@1**指标。测试结果呈现出显著的性能差距:pass@1得分从26.8%跨度至1%,这既证明了本数据集的难度,也清晰地划分了模型的性能层级。 <p align="center"> <img src="assets/resolve_rate.png" alt="SWE-bench++评估结果:按模型划分的解决率" width="850"/> <i>图2:SWE-bench++评估结果(按模型划分的解决率)</i> </p> --- ## 5. 元数据 ### 概览 **SWE-bench++(公开版)**是我们扩展版SWE-bench基准的社区可访问版本,包含500个高质量任务,用于评估大语言模型与编码智能体解决真实GitHub问题及拉取请求的能力。本数据集兼顾任务的数量与质量,通过捕获、爬取与打包多样化、高难度、高质量的PR构建而成。 ### 核心特性 - **任务规模**:覆盖多样化仓库与编程语言的500个任务 - **多语言支持**:7种编程语言 - **仓库覆盖**:11种仓库类型 - **问题覆盖**:6种问题类型 - **无版权问题** 我们将在下文列出这些分布情况。 --- <p align="center"> <img src="assets/prog_language_distr.png" alt="编程语言任务分布" width="700"/> <i>图3:SWE-bench++任务编程语言分布</i> </p> <br> <p align="center"> <img src="assets/issue_type_distr.png" alt="SWE-bench++任务的问题类型分布" width="700"/> <i>图4:SWE-bench++任务的问题类型分布</i> </p> <br> <p align="center"> <img src="assets/repo_type_distr.png" alt="仓库类型分布" width="700"/> <i>图5:仓库类型分布</i> </p> 我们采用有意粗粒度且高效的启发式数据源获取步骤,初始收集了超过50000个PR。该规模使得我们即使在筛选优质样本后,仍能保持较高的仓库覆盖度。 <br> <p align="center"> <img src="assets/difficulty_distr.png" alt="任务难度分布" width="700"/> <i>图6:任务难度分布</i> </p> 我们根据修改的代码行数与修改的文件数量对任务难度进行分类[占位符,等待官方数据]: if # lines of code edited > [x1] and # files edited > [x2]: task = hard if # lines of code edited > [y1] and # files edited > [y2]: task = medium if # lines of code edited > [z1] and # files edited > [z2]: task = easy 该分布体现了本数据集的整体难度,超过80%的任务属于中高难度范畴。更多元数据(包括修改代码行数、修改文件数量与许可证统计)详见附录。 --- ## 6. 意义与结论 通往通用人工智能(AGI)的道路恰似一场三足赛跑,模型改进与人类评估并驾齐驱:模型性能不断提升,基准框架随之迭代,循环往复。本质上,只有当基准足够严谨,能够揭示模型的局限性时,才能系统性地改进模型,由此形成一个反馈闭环:更优秀的模型需要更完善的基准,反之亦然。双方彼此依存,共同推动领域进步。 在“基准建设”层面,SWE-bench++为行业发展提供了所需的前进动力。 该框架可泛化至其他软件工程任务(包括那些拥有非标准构建流程或依赖外部硬件的任务),同时为模型性能提升与未来研究进展铺平了道路(例如逼真的、可动态演化的强化学习仿真环境)。 SWE-bench++为软件推理能力的评估与训练树立了新的标准,其核心创新解决了排行榜过拟合问题,并助力开发出能够更稳健地**推理**、**自我修正**与**规划**的模型。 --- ## 7. 许可与权限 图灵企业有限公司授予您一项全球范围内的、免版税的、非排他性、不可转让且可撤销的有限许可,仅允许您出于**非商业研究、学术或教育目的**访问、使用、复制本**数据集**并创作其衍生作品。 本许可仅用于推动数据集的实验、基准测试与研究工作。 您**不得**将本数据集或其任何衍生作品用于商业用途。 若您对商业使用感兴趣,请联系 <a href="mailto:yuzhao.ni@turing.com?subject=Extended SWE-bench Commercial Access" style="font-weight: bold;">yuzhao.ni@turing.com</a>。 本数据集按“现状”提供,不附带任何明示或暗示的担保,包括但不限于适销性、特定用途适用性与非侵权性的担保。在任何情况下,图灵企业均不对因本数据集或其使用产生的任何直接或间接索赔、损害或其他责任承担责任,无论该责任源于合同、侵权或其他事由。 --- ## 8. 附录 我们在下文提供更多任务元数据,以凸显本数据集的广泛覆盖范围。 <p align="center"> <img src="assets/loc_edited_distr.png" alt="修改代码行数分布" width="700"/> <i>图7:修改代码行数分布(来自PR)</i> </p> <p align="center"> <img src="assets/files_edited_distr.png" alt="修改文件数量分布" width="700"/> <i>图8:修改文件数量分布(来自PR)</i> </p> <p align="center"> <img src="assets/license_distr.png" alt="许可证分布" width="700"/> <i>图9:仓库许可证数量分布</i> </p> --- **SWE-bench++概览** 1. 摘要 2. 快速上手 3. 基准构建(方法论) 4. 实验结果 5. 元数据 6. 意义与结论 7. 许可与权限 8. 附录
提供机构:
maas
创建时间:
2025-12-04
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作