AGENTISSUE-BENCH
收藏arXiv2025-09-30 收录
下载链接:
https://alfin06.github.io/AgentIssue-Bench-Leaderboard/#/
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是一个可复制的基准测试,包含了50个代理问题解决任务,旨在评估最先进的软件工程(SE)代理在解决现实世界中的代理问题上的有效性。该基准测试为每个任务提供了可执行的环境和触发失败的测试,这有助于全面评估SE代理的能力。规模上,该数据集包含了50个解决任务,其核心任务是代理问题解决。
This dataset is a reproducible benchmark containing 50 agentic problem-solving tasks, which is designed to evaluate the effectiveness of state-of-the-art software engineering (SE) agents in solving real-world agentic problems. For each task, this benchmark provides executable environments and failure-triggering tests, which enable comprehensive assessment of SE agents' capabilities. In terms of scale, the dataset includes 50 tasks, whose core task is agentic problem-solving.



