SWE-bench_Verified

Name: SWE-bench_Verified
Creator: maas
Published: 2026-05-15 17:36:53
License: 暂无描述

魔搭社区2026-05-15 更新2024-08-31 收录

下载链接：

https://modelscope.cn/datasets/AI-ModelScope/SWE-bench_Verified

下载链接

链接失效反馈

官方服务：

资源简介：

**Dataset Summary** SWE-bench Verified is a subset of 500 samples from the SWE-bench test set, which have been human-validated for quality. SWE-bench is a dataset that tests systems’ ability to solve GitHub issues automatically. See this post for more details on the human-validation process. The dataset collects 500 test Issue-Pull Request pairs from popular Python repositories. Evaluation is performed by unit test verification using post-PR behavior as the reference solution. The original SWE-bench dataset was released as part of SWE-bench: Can Language Models Resolve Real-World GitHub Issues? **Want to run inference now?** This dataset only contains the problem_statement (i.e. issue text) and the base_commit which represents the state of the codebase before the issue has been resolved. If you want to run inference using the "Oracle" or BM25 retrieval settings mentioned in the paper, consider the following datasets. princeton-nlp/SWE-bench_Lite_oracle princeton-nlp/SWE-bench_Lite_bm25_13K princeton-nlp/SWE-bench_Lite_bm25_27K **Supported Tasks and Leaderboards** SWE-bench proposes a new task: issue resolution provided a full repository and GitHub issue. The leaderboard can be found at www.swebench.com **Languages** The text of the dataset is primarily English, but we make no effort to filter or otherwise clean based on language type. **Dataset Structure** An example of a SWE-bench datum is as follows: ``` instance_id: (str) - A formatted instance identifier, usually as repo_owner__repo_name-PR-number. patch: (str) - The gold patch, the patch generated by the PR (minus test-related code), that resolved the issue. repo: (str) - The repository owner/name identifier from GitHub. base_commit: (str) - The commit hash of the repository representing the HEAD of the repository before the solution PR is applied. hints_text: (str) - Comments made on the issue prior to the creation of the solution PR’s first commit creation date. created_at: (str) - The creation date of the pull request. test_patch: (str) - A test-file patch that was contributed by the solution PR. problem_statement: (str) - The issue title and body. version: (str) - Installation version to use for running evaluation. environment_setup_commit: (str) - commit hash to use for environment setup and installation. FAIL_TO_PASS: (str) - A json list of strings that represent the set of tests resolved by the PR and tied to the issue resolution. PASS_TO_PASS: (str) - A json list of strings that represent tests that should pass before and after the PR application. ```

**数据集概述** SWE-bench Verified 是从 SWE-bench 测试集中筛选出的 500 个样本子集，所有样本均经过人工质量验证。SWE-bench 是用于测试系统自动解决 GitHub 议题能力的数据集，如需了解人工验证流程的更多细节，请参阅该博文。该数据集从热门 Python 代码仓库中收集了 500 组测试用的议题-拉取请求（Issue-Pull Request）对。评估采用单元测试验证方式，以拉取请求（PR）合并后的代码行为作为参考解。原始 SWE-bench 数据集作为《SWE-bench：大语言模型能否解决真实世界的 GitHub 议题？》一文的配套资源发布。 **是否需要立即运行推理？** 本数据集仅包含议题陈述（problem_statement，即议题文本）以及代表议题解决前代码库状态的基准提交（base_commit）。若需使用论文中提及的"Oracle"或 BM25 检索设置开展推理，请参考以下数据集： princeton-nlp/SWE-bench_Lite_oracle princeton-nlp/SWE-bench_Lite_bm25_13K princeton-nlp/SWE-bench_Lite_bm25_27K **支持任务与排行榜** SWE-bench 提出了一项全新任务：在提供完整代码仓库与 GitHub 议题的前提下完成议题解决。相关排行榜可访问 www.swebench.com。 **语言** 本数据集的文本以英文为主，我们未针对语言类型进行任何过滤或清洗操作。 **数据集结构** SWE-bench 的单条数据示例如下：实例标识符（instance_id）：（字符串类型）格式化的实例标识，格式通常为 repo_owner__repo_name-PR-number。补丁（patch）：（字符串类型）金标准补丁，即由拉取请求生成的可解决对应议题的补丁（不含测试相关代码）。仓库（repo）：（字符串类型）GitHub 上的仓库所有者/名称标识符。基准提交（base_commit）：（字符串类型）代表解决方案拉取请求应用前仓库 HEAD 的提交哈希值。提示文本（hints_text）：（字符串类型）在解决方案拉取请求的首次提交创建日期之前，于该议题下发表的评论。创建时间（created_at）：（字符串类型）拉取请求的创建日期。测试补丁（test_patch）：（字符串类型）由解决方案拉取请求贡献的测试文件补丁。议题陈述（problem_statement）：（字符串类型）议题的标题与正文内容。版本（version）：（字符串类型）运行评估所需的安装版本。环境设置提交（environment_setup_commit）：（字符串类型）用于环境搭建与安装的提交哈希值。失败转通过（FAIL_TO_PASS）：（字符串类型）以 JSON 列表形式存储的字符串集合，代表该拉取请求解决且与议题解决相关的测试用例（即原本失败现已通过的测试）。始终通过（PASS_TO_PASS）：（字符串类型）以 JSON 列表形式存储的字符串集合，代表在拉取请求应用前后均应通过的测试用例。

提供机构：

maas

创建时间：

2024-08-14

5,000+

优质数据集

54 个

任务类型

进入经典数据集