AGENTISSUE-BENCH

arXiv2025-09-30 收录

下载链接：

https://alfin06.github.io/AgentIssue-Bench-Leaderboard/#/

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是一个可复制的基准测试，包含了50个代理问题解决任务，旨在评估最先进的软件工程（SE）代理在解决现实世界中的代理问题上的有效性。该基准测试为每个任务提供了可执行的环境和触发失败的测试，这有助于全面评估SE代理的能力。规模上，该数据集包含了50个解决任务，其核心任务是代理问题解决。

This dataset is a reproducible benchmark containing 50 agentic problem-solving tasks, which is designed to evaluate the effectiveness of state-of-the-art software engineering (SE) agents in solving real-world agentic problems. For each task, this benchmark provides executable environments and failure-triggering tests, which enable comprehensive assessment of SE agents' capabilities. In terms of scale, the dataset includes 50 tasks, whose core task is agentic problem-solving.

5,000+

优质数据集

54 个

任务类型

进入经典数据集