five

WorkBench

收藏
arXiv2024-05-02 更新2024-06-21 收录
下载链接:
https://github.com/olly-styles/WorkBench
下载链接
链接失效反馈
官方服务:
资源简介:
WorkBench是一个用于评估代理在真实工作场所环境中执行任务能力的数据集。该数据集包含一个沙盒环境,其中有五个数据库、26种工具和690个任务,这些任务代表了发送电子邮件和安排会议等常见业务活动。WorkBench的创建旨在通过其独特的、明确的任务结果来进行稳健的自动化评估,从而揭示代理在执行常见业务活动时的弱点,并提出对其在高风险工作场所设置中使用的质疑。

WorkBench is a dataset designed to evaluate agents' ability to perform tasks in real-world workplace environments. This dataset includes a sandbox environment containing five databases, 26 types of tools, and 690 tasks, which represent common business activities such as sending emails and scheduling meetings. WorkBench was developed to enable robust automated evaluation through its unique and clearly defined task outcomes, thereby uncovering the weaknesses of agents when executing common business activities and raising questions about their use in high-stakes workplace settings.
提供机构:
格拉斯哥大学
创建时间:
2024-05-02
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作