PyBench

arXiv2025-09-30 收录

下载链接：

https://github.com/Mercury7353/PyBench

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是一个涵盖现实世界中五大类编程任务的基准测试，包含了超过10种类型的文件。它要求LLM智能体基于高级用户查询，通过代码解释器进行推理并执行Python代码。此外，该基准测试还考察智能体在规划、多轮交互、利用代码反馈以及生成正式回应方面的能力，突显了除基本编程技能之外，对综合能力的需求。任务旨在评估LLM智能体在现实世界编程任务中的表现。

This dataset is a benchmark encompassing five categories of real-world programming tasks, including over ten types of files. It requires LLM Agents to conduct reasoning and execute Python code via a code interpreter based on high-level user queries. Furthermore, this benchmark evaluates the Agents' capabilities in planning, multi-turn interaction, leveraging code feedback, and generating formal responses, highlighting the requirement for comprehensive skills beyond basic programming proficiency. The tasks are designed to assess the performance of LLM Agents in real-world programming scenarios.

5,000+

优质数据集

54 个

任务类型

进入经典数据集