SWE-rebench-leaderboard
收藏魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/nebius/SWE-rebench-leaderboard
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Summary
SWE-rebench-leaderboard is a continuously updated, curated subset of the full [SWE-rebench](https://huggingface.co/datasets/nebius/SWE-rebench) corpus, tailored for benchmarking software engineering agents on real-world tasks.
These tasks are used in the [SWE-rebench leaderboard](https://swe-rebench.com/leaderboard). For more details on the benchmark methodology and data collection process, please refer to our paper [SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents](https://arxiv.org/abs/2505.20411).
All Docker images required to run the tasks are pre-built and publicly available on [Docker Hub](https://hub.docker.com/repositories/swerebench). You do not need to build them yourself. The specific image for each task is listed in the `docker_image` column.
To get the exact subset of tasks used for a specific month's SWE-rebench-leaderboard, you can filter the dataset by the `created_at` field.
# News
[2025/09/19] Added a split for each month.
[2025/09/01] Added 52 August tasks, each with a corresponding Docker image.
[2025/08/04] Added 34 July tasks, each with a corresponding Docker image.
# How to Use
```python
from datasets import load_dataset
ds = load_dataset('nebius/SWE-rebench-leaderboard')
ds_june_2025 = ds['test'].filter(lambda x: x['created_at'].startswith('2025-06'))
```
# Dataset Structure
The SWE-rebench dataset schema extends the original SWE-bench schema with additional fields to support richer analysis. The complete schema is detailed in the table below. For more information about this data and methodology behind collecting it, please refer to our paper.
| Field name | Type | Description |
|----------------------------|--------|-------------------------------------------------------------------------------------------------|
| `instance_id` | str | A formatted instance identifier, usually as `repo_owner__repo_name-PR-number`. |
| `patch` | str | The gold patch, the patch generated by the PR (minus test-related code), that resolved the issue. |
| `repo` | str | The repository owner/name identifier from GitHub. |
| `base_commit` | str | The commit hash of the repository representing the HEAD of the repository before the solution PR is applied. |
| `hints_text` | str | Comments made on the issue prior to the creation of the solution PR’s first commit creation date. |
| `created_at` | str | The creation date of the pull request. |
| `test_patch` | str | A test-file patch that was contributed by the solution PR. |
| `problem_statement` | str | The issue title and body. |
| `version` | str | Installation version to use for running evaluation. |
| `environment_setup_commit` | str | Commit hash to use for environment setup and installation. |
| `FAIL_TO_PASS` | str | A JSON list of strings that represent the set of tests resolved by the PR and tied to the issue resolution. |
| `PASS_TO_PASS` | str | A JSON list of strings that represent tests that should pass before and after the PR application. |
| `meta` | str | A JSON dictionary indicating whether the instance is lite, along with a list of failed lite validators if it is not. |
| `license_name` | str | The type of license of the repository. |
| `install_config` | str | Installation configuration for setting up the repository.
| `docker_image` | str | Docker image name for the instance. |
To execute tasks from SWE-rebench (i.e., set up their environments, apply patches, and run tests), we provide a [fork](https://github.com/SWE-rebench/SWE-bench-fork) of the original SWE-bench execution framework, adapted for our dataset's structure and features.
The primary modification introduces functionality to source environment installation constants directly from the `install_config` field present in each task instance within SWE-rebench. This allows for more flexible and task-specific environment setups.
You can find the details of this modification in the
[following commit](https://github.com/SWE-rebench/SWE-bench-fork/commit/980d0cca8aa4e73f1d9f894e906370bef8c4de8a)
To build the necessary Docker images and run agents on SWE-rebench tasks, you have two main options:
1. **Use our SWE-bench fork directly:** Clone the fork and utilize its scripts for building images and executing tasks. The framework will automatically use the `install_config` from each task.
2. **Integrate similar functionality into your existing codebase:** If you have your own execution framework based on SWE-bench or a different system, you can adapt it by implementing a similar mechanism to parse and utilize the `install_config` field from the SWE-rebench task instances. The aforementioned commit can serve as a reference for this integration.
# License
The dataset is licensed under the Creative Commons Attribution 4.0 license. However, please respect the license of each specific repository on which a particular instance is based. To facilitate this, the license of each repository at the time of the commit is provided for every instance.
# Citation
```bibtex
@misc{badertdinov2025swerebenchautomatedpipelinetask,
title={SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents},
author={Ibragim Badertdinov and Alexander Golubev and Maksim Nekrashevich and Anton Shevtsov and Simon Karasik and Andrei Andriushchenko and Maria Trofimova and Daria Litvintseva and Boris Yangel},
year={2025},
eprint={2505.20411},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2505.20411}
}
# 数据集概述
SWE-rebench-leaderboard 是完整 [SWE-rebench](https://huggingface.co/datasets/nebius/SWE-rebench) 语料库的持续更新、经过精选的子集,专为在真实世界任务中对软件工程AI智能体(AI Agent)进行基准测试而定制。
这些任务被用于 [SWE-rebench 排行榜](https://swe-rebench.com/leaderboard)。如需了解该基准测试的方法学与数据收集流程的更多细节,请参阅我们的论文《SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents》(arXiv:2505.20411)。
运行任务所需的所有Docker镜像均已预构建,并在 [Docker Hub](https://hub.docker.com/repositories/swerebench) 上公开发布,您无需自行构建。每个任务对应的专属镜像已在`docker_image`字段中列出。
若需获取对应特定月份SWE-rebench-leaderboard的精确任务子集,可通过`created_at`字段对数据集进行筛选。
# 更新动态
[2025/09/19] 新增按月划分的拆分集。
[2025/09/01] 新增52个8月任务,每个任务均配有对应的Docker镜像。
[2025/08/04] 新增34个7月任务,每个任务均配有对应的Docker镜像。
# 使用方法
python
from datasets import load_dataset
ds = load_dataset('nebius/SWE-rebench-leaderboard')
ds_june_2025 = ds['test'].filter(lambda x: x['created_at'].startswith('2025-06'))
# 数据集结构
SWE-rebench 数据集的数据模式(schema)扩展了原始SWE-bench的数据模式,新增了若干字段以支持更丰富的分析。完整的数据模式如下表所示。如需了解该数据集及其收集方法的更多信息,请参阅我们的论文。
| 字段名 | 类型 | 描述 |
|----------------------------|--------|-------------------------------------------------------------------------------------------------|
| `instance_id` | str | 格式化后的实例标识符,通常采用 `repo_owner__repo_name-PR-number` 格式。 |
| `patch` | str | 黄金补丁(gold patch),即该PR生成的、剔除测试相关代码的补丁,用于解决对应问题。 |
| `repo` | str | GitHub 上的仓库所有者/名称标识符。 |
| `base_commit` | str | 应用解决方案PR前,对应仓库HEAD的提交哈希。 |
| `hints_text` | str | 解决方案PR的首条提交创建日期之前,在该问题下留下的评论。 |
| `created_at` | str | 拉取请求(Pull Request,PR)的创建日期。 |
| `test_patch` | str | 解决方案PR贡献的测试文件补丁。 |
| `problem_statement` | str | 问题的标题与正文。 |
| `version` | str | 运行评估时使用的安装版本。 |
| `environment_setup_commit` | str | 用于环境搭建与安装的提交哈希。 |
| `FAIL_TO_PASS` | str | JSON格式的字符串列表,表示该PR解决的、与问题修复相关的测试用例集合。 |
| `PASS_TO_PASS` | str | JSON格式的字符串列表,表示在PR应用前后均应通过的测试用例集合。 |
| `meta` | str | JSON格式的字典,用于标记该实例是否为轻量版(lite),若非轻量版则包含失败的轻量验证器列表。 |
| `license_name` | str | 该仓库的许可证类型。 |
| `install_config` | str | 用于设置仓库的安装配置。 |
| `docker_image` | str | 该实例对应的Docker镜像名称。 |
为运行SWE-rebench中的任务(即搭建任务环境、应用补丁并执行测试),我们提供了原始SWE-bench执行框架的复刻版(fork),适配了本数据集的结构与特性,仓库地址为 [SWE-bench-fork](https://github.com/SWE-rebench/SWE-bench-fork)。
该复刻版的核心修改在于新增了直接从SWE-rebench每个任务实例的`install_config`字段中获取环境安装配置参数的功能,从而支持更灵活且贴合任务需求的环境搭建方案。
您可通过以下提交记录查看该修改的具体细节:[commit 980d0cca8aa4e73f1d9f894e906370bef8c4de8a](https://github.com/SWE-rebench/SWE-bench-fork/commit/980d0cca8aa4e73f1d9f894e906370bef8c4de8a)。
如需构建所需的Docker镜像并在SWE-rebench任务上运行智能体,您有两种主要方案:
1. **直接使用我们的SWE-bench复刻版**:克隆该复刻仓库并利用其脚本构建镜像与执行任务。该框架将自动调用每个任务的`install_config`字段。
2. **在您现有的代码库中集成类似功能**:若您拥有基于SWE-bench或其他系统的自研执行框架,可通过实现类似的解析与使用`install_config`字段的机制进行适配。上述提交记录可作为该集成的参考。
# 许可证
本数据集采用知识共享署名4.0(Creative Commons Attribution 4.0)许可证进行授权。但请您尊重每个任务所基于的特定仓库的许可证条款。为方便您查阅,每个实例均提供了对应提交时刻该仓库的许可证信息。
# 引用格式
bibtex
@misc{badertdinov2025swerebenchautomatedpipelinetask,
title={SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents},
author={Ibragim Badertdinov and Alexander Golubev and Maksim Nekrashevich and Anton Shevtsov and Simon Karasik and Andrei Andriushchenko and Maria Trofimova and Daria Litvintseva and Boris Yangel},
year={2025},
eprint={2505.20411},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2505.20411}
}
提供机构:
maas
创建时间:
2025-10-28



