WorkBench
收藏arXiv2024-05-02 更新2024-06-21 收录
下载链接:
https://github.com/olly-styles/WorkBench
下载链接
链接失效反馈官方服务:
资源简介:
WorkBench是一个用于评估代理在真实工作场所环境中执行任务能力的数据集。该数据集包含一个沙盒环境,其中有五个数据库、26种工具和690个任务,这些任务代表了发送电子邮件和安排会议等常见业务活动。WorkBench的创建旨在通过其独特的、明确的任务结果来进行稳健的自动化评估,从而揭示代理在执行常见业务活动时的弱点,并提出对其在高风险工作场所设置中使用的质疑。
WorkBench is a dataset designed to evaluate agents' ability to perform tasks in real-world workplace environments. This dataset includes a sandbox environment containing five databases, 26 types of tools, and 690 tasks, which represent common business activities such as sending emails and scheduling meetings. WorkBench was developed to enable robust automated evaluation through its unique and clearly defined task outcomes, thereby uncovering the weaknesses of agents when executing common business activities and raising questions about their use in high-stakes workplace settings.
提供机构:
格拉斯哥大学
创建时间:
2024-05-02



