SWE-bench-extra
收藏魔搭社区2026-04-16 更新2024-12-28 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/SWE-bench-extra
下载链接
链接失效反馈官方服务:
资源简介:
*Note: This dataset has an improved and significantly larger successor: [SWE-rebench](https://huggingface.co/datasets/nebius/SWE-rebench).*
# Dataset Summary
SWE-bench Extra is a dataset that can be used to train or evaluate agentic systems specializing in resolving GitHub issues. It is based on the methodology used to build SWE-bench benchmark and includes 6,415 Issue-Pull Request pairs sourced from 1,988 Python repositories.
# Dataset Description
The SWE-bench Extra dataset supports the development of software engineering agents capable of autonomously solving GitHub issues. The data collection process, based on the SWE-bench methodology, involves the following steps:
1. **Issue and Pull Request Collection**: Issues are gathered and linked with pull requests that successfully resolve them.
2. **Filtering**: Instances are filtered based on attributes such as issue descriptions, relevant code paths, and test patches.
3. **Execution-based Validation**: The project environments are set up and tests are run to verify that they execute correctly.
For a more detailed description of the data collection process, please refer to our blog post [Scaling data collection for training software engineering agents](https://nebius.com/blog/posts/scaling-data-collection-for-training-swe-agents).
As an example use case of this dataset, we’ve used SWE-bench-extra instances to generate a dataset of 80,036 trajectories [`nebius/swe-agent-trajectories`](https://huggingface.co/datasets/nebius/swe-agent-trajectories). We’ve then trained an action generator model, that achieves a score of 19.2% on the subset of 50 random instances from the SWE-bench Verified benchmark, representing a 30% relative improvement over its parent model Qwen2.5-72B-Instruct, which scored 14.8%. Further augmenting the action generator with a guided search based on a critic model, also trained on this data, achieves 40.6% on the full SWE-bench Verified benchmark, which is state-of-the-art among agents using solely open-weight models. You can read more about this agent in our blog post, [“Leveraging Training and Search for Better Software Engineering Agents”](https://nebius.com/blog/posts/training-and-search-for-software-engineering-agents).
# How to Use
```python
from datasets import load_dataset
ds = load_dataset('nebius/SWE-bench-extra')
```
# Dataset Statistics
Average, 75th percentile, and maximum values characterizing various attributes of the collected instances. Statistics are micro-averaged without grouping by repository.
| Data | Type | Mean | p75 | Max |
|---------------|--------------------|----------|----------|-----------|
| Issue text | Length (words) | 111.5 | 146 | 1,294 |
| Code base | Files (Non-test) | 71.71 | 72.00 | 2,264 |
| | Lines (Non-test) | 15,163.38| 13,777 | 1,039,288 |
| Gold patch | Files edited | 2.6 | 3 | 7 |
| | Lines edited | 56 | 76 | 300 |
| Tests | Fail to Pass | 10.94 | 5 | 4,941 |
| | Total | 58.5 | 49 | 7,820 |
# Dataset Structure
The dataset contains the following fields. It includes all fields from SWE-bench and adds a `meta` column, which indicates whether the instance meets the "lite" criteria and, if not, lists the failed validators.
| Field name | Type | Description |
|----------------------------|--------|-------------------------------------------------------------------------------------------------|
| `instance_id` | str | A formatted instance identifier, usually as `repo_owner__repo_name-PR-number`. |
| `patch` | str | The gold patch, the patch generated by the PR (minus test-related code), that resolved the issue. |
| `repo` | str | The repository owner/name identifier from GitHub. |
| `base_commit` | str | The commit hash of the repository representing the HEAD of the repository before the solution PR is applied. |
| `hints_text` | str | Comments made on the issue prior to the creation of the solution PR’s first commit creation date. |
| `created_at` | str | The creation date of the pull request. |
| `test_patch` | str | A test-file patch that was contributed by the solution PR. |
| `problem_statement` | str | The issue title and body. |
| `version` | str | Installation version to use for running evaluation. |
| `environment_setup_commit` | str | Commit hash to use for environment setup and installation. |
| `FAIL_TO_PASS` | str | A JSON list of strings that represent the set of tests resolved by the PR and tied to the issue resolution. |
| `PASS_TO_PASS` | str | A JSON list of strings that represent tests that should pass before and after the PR application. |
| `meta` | str | A JSON dictionary indicating whether the instance is lite, along with a list of failed lite validators if it is not. |
| `license` | str | The type of license of the repository. |
To execute instances within SWE-bench, you need to provide a default recipe for dependency installation. The constants required for running these instances are described in this [constants.py](https://huggingface.co/datasets/nebius/SWE-bench-extra/blob/main/constants.py).
# License
The dataset is licensed under the Creative Commons Attribution 4.0 license. However, please respect the license of each specific repository on which a particular instance is based. To facilitate this, the license of each repository at the time of the commit is provided for every instance.
*注:本数据集已有经过改进且规模显著更大的后继版本:[SWE-rebench](https://huggingface.co/datasets/nebius/SWE-rebench)。*
# 数据集概述
SWE-bench Extra 是可用于训练或评估专门用于解决GitHub议题(GitHub Issue)的智能体系统(Agentic Systems)的数据集。其构建方法沿袭了SWE-bench基准测试的方法论,共包含来自1988个Python代码仓库的6415个「议题-拉取请求(Pull Request,简称PR)」对。
# 数据集说明
SWE-bench Extra 数据集可用于研发能够自主解决GitHub议题(GitHub Issue)的软件工程智能体(Software Engineering Agents)。基于SWE-bench方法论的数据采集流程包含以下步骤:
1. **议题与拉取请求(Pull Request,简称PR)收集**:收集议题,并与成功解决该议题的拉取请求进行关联。
2. **筛选**:根据议题描述、相关代码路径以及测试补丁等属性对样本进行筛选。
3. **基于执行的验证**:搭建项目环境并运行测试,以验证流程可正常执行。
如需了解数据采集流程的详细说明,请参阅我们的博客文章[《为训练软件工程智能体扩展数据采集规模》](https://nebius.com/blog/posts/scaling-data-collection-for-training-swe-agents)。
本数据集的一个典型应用案例为:我们使用SWE-bench-extra的样本生成了包含80036条轨迹的数据集[`nebius/swe-agent-trajectories`](https://huggingface.co/datasets/nebius/swe-agent-trajectories)。随后我们训练了一个动作生成模型,该模型在从SWE-bench Verified基准测试中随机抽取的50个样本子集上取得了19.2%的得分,相比其父模型Qwen2.5-72B-Instruct(得分14.8%)实现了30%的相对性能提升。进一步基于在此数据集上训练的评判模型(Critic Model)引入引导式搜索后,该模型在完整的SWE-bench Verified基准测试上取得了40.6%的得分,在仅使用开源权重模型(Open-weight Models)的智能体中达到了当前最优水平。您可通过我们的博客文章[《借助训练与搜索优化软件工程智能体》](https://nebius.com/blog/posts/training-and-search-for-software-engineering-agents)了解该智能体的更多细节。
# 使用方法
python
from datasets import load_dataset
ds = load_dataset('nebius/SWE-bench-extra')
# 数据集统计量
本统计量为采集样本各项属性的平均值、75分位数与最大值,未按代码仓库分组进行微观平均。
| 数据类别 | 统计维度 | 平均值 | 75分位数 | 最大值 |
|----------------|------------------------|----------|----------|------------|
| 议题文本 | 长度(词数) | 111.5 | 146 | 1294 |
| 代码仓库 | 文件数(非测试文件) | 71.71 | 72.00 | 2264 |
| | 代码行数(非测试代码) | 15163.38 | 13777 | 1039288 |
| 目标补丁 | 编辑文件数 | 2.6 | 3 | 7 |
| | 编辑代码行数 | 56 | 76 | 300 |
| 测试用例 | 待修复失败用例数 | 10.94 | 5 | 4941 |
| | 总测试用例数 | 58.5 | 49 | 7820 |
# 数据集结构
本数据集包含以下字段,继承了SWE-bench的全部字段,并新增了`meta`列,用于指示该样本是否符合「轻量版」标准,若不符合则会列出未通过的验证项。
| 字段名 | 类型 | 说明 |
|----------------------------|--------|------------------------------------------------------------------------------------------|
| `instance_id` | 字符串 | 格式化的样本标识符,通常格式为`repo_owner__repo_name-PR-number`。 |
| `patch` | 字符串 | 目标补丁,即由拉取请求生成的(剔除测试相关代码的)解决议题的补丁。 |
| `repo` | 字符串 | GitHub上的代码仓库所有者/名称标识符。 |
| `base_commit` | 字符串 | 代码仓库的提交哈希值,代表解决方案拉取请求应用前的仓库HEAD版本。 |
| `hints_text` | 字符串 | 解决方案拉取请求的首个提交创建日期之前,在议题下留下的评论。 |
| `created_at` | 字符串 | 拉取请求的创建日期。 |
| `test_patch` | 字符串 | 由解决方案拉取请求贡献的测试文件补丁。 |
| `problem_statement` | 字符串 | 议题的标题与正文内容。 |
| `version` | 字符串 | 用于运行评估的安装版本。 |
| `environment_setup_commit` | 字符串 | 用于环境搭建与安装的提交哈希值。 |
| `FAIL_TO_PASS` | 字符串 | JSON格式的字符串列表,表示由拉取请求解决并与议题解决相关的测试用例集合。 |
| `PASS_TO_PASS` | 字符串 | JSON格式的字符串列表,表示在拉取请求应用前后均应通过的测试用例集合。 |
| `meta` | 字符串 | JSON格式的字典,用于指示该样本是否为轻量版样本,若非轻量版则会列出未通过的轻量版验证项。 |
| `license` | 字符串 | 该代码仓库的许可证类型。 |
如需在SWE-bench中执行样本,您需要提供依赖安装的默认脚本。运行这些样本所需的常量已在该[constants.py](https://huggingface.co/datasets/nebius/SWE-bench-extra/blob/main/constants.py)中说明。
# 许可证
本数据集采用知识共享署名4.0许可证进行授权。但请您尊重每个样本所依赖的具体代码仓库的许可证条款。为方便您查阅,每个样本对应的代码仓库在对应提交时的许可证信息均已提供。
提供机构:
maas
创建时间:
2024-12-22



