five

Claw-Eval

收藏
魔搭社区2026-05-17 更新2026-05-10 收录
下载链接:
https://modelscope.cn/datasets/claw-eval/Claw-Eval
下载链接
链接失效反馈
官方服务:
资源简介:
<div align="center"> <h1>Claw-Eval</h1> <img src="assets/claw_eval.png" alt="Claw-Eval Logo" width="200"> [![Tasks](https://img.shields.io/badge/tasks-300-blue)](#dataset-structure) [![Leaderboard](https://img.shields.io/badge/leaderboard-live-purple)](https://claw-eval.github.io) [![License](https://img.shields.io/badge/license-MIT-orange)](https://github.com/claw-eval/claw-eval/blob/main/LICENSE) **End-to-end transparent benchmark for AI agents acting in the real world.** [Paper](https://huggingface.co/papers/2604.06132) | [Leaderboard](https://claw-eval.github.io) | [Code](https://github.com/claw-eval/claw-eval) </div> --- ## Dataset Structure ### Splits | Split | Examples | Description | |---|---:|---| | `general` | 161 | Core agent tasks across 24 categories (communication, finance, ops, productivity, etc.) | | `multimodal` | 101 | Multimodal agentic tasks requiring perception and creation (webpage generation, video QA, document extraction, etc.) | | `multi_turn` | 38 | Multi-turn conversational tasks where the agent interacts with a simulated user persona to clarify needs and provide advice | ### Fields | Field | Type | Description | |---|---|---| | `task_id` | string | Unique task identifier | | `query` | string | Task instruction / description | | `fixture` | list[string] | Fixture files required for the task (available in `data/fixtures.tar.gz`) | | `language` | string | Task language (`en` or `zh`) | | `category` | string | Task domain | ## Usage ```python from datasets import load_dataset # Load all splits dataset = load_dataset("claw-eval/Claw-Eval") # Load a specific split general = load_dataset("claw-eval/Claw-Eval", split="general") multimodal = load_dataset("claw-eval/Claw-Eval", split="multimodal") multi_turn = load_dataset("claw-eval/Claw-Eval", split="multi_turn") # Inspect a sample print(general[0]) ``` ## Acknowledgements Our test cases are built on the work of the community. We draw from and adapt tasks contributed by OpenClaw, PinchBench, OfficeQA, OneMillion-Bench, Finance Agent, and Terminal-Bench 2.0. ## Citation If you use Claw-Eval in your research, please cite: ```bibtex @article{ye2026claw, title={Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents}, author={Ye, Bowen and Li, Rang and Yang, Qibin and Liu, Yuanxin and Yao, Linli and Lv, Hanglong and Xie, Zhihui and An, Chenxin and Li, Lei and Kong, Lingpeng and others}, journal={arXiv preprint arXiv:2604.06132}, year={2026} } ``` ## Core Contributors [Bowen Ye](https://github.com/pkuYmiracle)(PKU), [Rang Li](https://github.com/lirang04) (PKU), [Qibin Yang](https://github.com/yangqibin-caibi) (PKU), [Zhihui Xie](https://zhxie.site/)(HKU), [Yuanxin Liu](https://llyx97.github.io/)(PKU), [Linli Yao](https://yaolinli.github.io/)(PKU), [Hanglong Lyu](https://github.com/Albus2002)(PKU), [Lei Li](lilei-nlp.github.io)(HKU, project lead) ## Advisors: [Tong Yang](https://yangtonghome.github.io/) (PKU), [Zhifang Sui](https://cs.pku.edu.cn/info/1226/2014.htm) (PKU), [Lingpeng Kong](https://ikekonglp.github.io/) (HKU), [Qi Liu](https://leuchine.github.io/) (HKU) ## Contribution We welcome any kind of contribution. Let us know if you have any suggestions! ## License This dataset is released under the [MIT License](https://github.com/claw-eval/claw-eval/blob/main/LICENSE).

<div align="center"> <h1>Claw-Eval</h1> <img src="assets/claw_eval.png" alt="Claw-Eval 标志" width="200"> [![任务数](https://img.shields.io/badge/tasks-300-blue)](#dataset-structure) [![实时排行榜](https://img.shields.io/badge/leaderboard-live-purple)](https://claw-eval.github.io) [![许可证](https://img.shields.io/badge/license-MIT-orange)](https://github.com/claw-eval/claw-eval/blob/main/LICENSE) **面向真实世界中运行的AI智能体(AI Agent)的端到端可透明化评测基准。** [论文](https://huggingface.co/papers/2604.06132) | [排行榜](https://claw-eval.github.io) | [代码](https://github.com/claw-eval/claw-eval) </div> --- ## 数据集结构 ### 数据集划分 | 划分集名称 | 样本数量 | 描述 | |---|---:|---| | `general` | 161 | 涵盖24个类别(通信、金融、运维、生产力工具等)的核心智能体任务 | | `multimodal` | 101 | 需要感知与创作能力的多模态智能体任务,包括网页生成、视频问答、文档提取等 | | `multi_turn` | 38 | 多轮对话任务,智能体需与模拟用户角色交互以明确需求并提供建议 | ### 字段说明 | 字段名 | 类型 | 描述 | |---|---|---| | `task_id` | 字符串 | 任务唯一标识符 | | `query` | 字符串 | 任务指令/任务描述 | | `fixture` | 字符串列表 | 任务所需的配套文件(可在`data/fixtures.tar.gz`中获取) | | `language` | 字符串 | 任务语言,可选值为`en`(英语)或`zh`(中文) | | `category` | 字符串 | 任务所属领域 | ## 使用方法 python from datasets import load_dataset # 加载全部划分集 dataset = load_dataset("claw-eval/Claw-Eval") # 加载指定划分集 general = load_dataset("claw-eval/Claw-Eval", split="general") multimodal = load_dataset("claw-eval/Claw-Eval", split="multimodal") multi_turn = load_dataset("claw-eval/Claw-Eval", split="multi_turn") # 查看样本示例 print(general[0]) ## 致谢 本评测的测试用例基于社区已有工作构建,我们借鉴并改编了OpenClaw、PinchBench、OfficeQA、OneMillion-Bench、Finance Agent以及Terminal-Bench 2.0所贡献的任务。 ## 引用格式 若您在研究中使用Claw-Eval,请引用以下文献: bibtex @article{ye2026claw, title={Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents}, author={Ye, Bowen and Li, Rang and Yang, Qibin and Liu, Yuanxin and Yao, Linli and Lv, Hanglong and Xie, Zhihui and An, Chenxin and Li, Lei and Kong, Lingpeng and others}, journal={arXiv preprint arXiv:2604.06132}, year={2026} } ## 核心贡献者 [叶博文](https://github.com/pkuYmiracle)(北京大学)、[李让](https://github.com/lirang04)(北京大学)、[杨祺斌](https://github.com/yangqibin-caibi)(北京大学)、[谢志辉](https://zhxie.site/)(香港大学)、[刘元鑫](https://llyx97.github.io/)(北京大学)、[姚林丽](https://yaolinli.github.io/)(北京大学)、[吕航龙](https://github.com/Albus2002)(北京大学)、[李磊](https://lilei-nlp.github.io)(香港大学,项目负责人) ## 指导委员会 [杨童](https://yangtonghome.github.io/)(北京大学)、[隋志芳](https://cs.pku.edu.cn/info/1226/2014.htm)(北京大学)、[孔令鹏](https://ikekonglp.github.io/)(香港大学)、[刘琪](https://leuchine.github.io/)(香港大学) ## 贡献须知 我们欢迎各类形式的贡献,若您有任何建议,欢迎随时与我们联系! ## 许可证 本数据集采用[MIT许可证](https://github.com/claw-eval/claw-eval/blob/main/LICENSE)发布。
提供机构:
maas
创建时间:
2026-04-28
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作