Claw-Eval

Name: Claw-Eval
Creator: maas
Published: 2026-05-17 02:04:26
License: 暂无描述

魔搭社区2026-05-17 更新2026-05-10 收录

下载链接：

https://modelscope.cn/datasets/claw-eval/Claw-Eval

下载链接

链接失效反馈

官方服务：

资源简介：

<div align="center"> <h1>Claw-Eval</h1> <img src="assets/claw_eval.png" alt="Claw-Eval Logo" width="200"> [![Tasks](https://img.shields.io/badge/tasks-300-blue)](#dataset-structure) [![Leaderboard](https://img.shields.io/badge/leaderboard-live-purple)](https://claw-eval.github.io) [![License](https://img.shields.io/badge/license-MIT-orange)](https://github.com/claw-eval/claw-eval/blob/main/LICENSE) **End-to-end transparent benchmark for AI agents acting in the real world.** [Paper](https://huggingface.co/papers/2604.06132) | [Leaderboard](https://claw-eval.github.io) | [Code](https://github.com/claw-eval/claw-eval) </div> --- ## Dataset Structure ### Splits | Split | Examples | Description | |---|---:|---| | `general` | 161 | Core agent tasks across 24 categories (communication, finance, ops, productivity, etc.) | | `multimodal` | 101 | Multimodal agentic tasks requiring perception and creation (webpage generation, video QA, document extraction, etc.) | | `multi_turn` | 38 | Multi-turn conversational tasks where the agent interacts with a simulated user persona to clarify needs and provide advice | ### Fields | Field | Type | Description | |---|---|---| | `task_id` | string | Unique task identifier | | `query` | string | Task instruction / description | | `fixture` | list[string] | Fixture files required for the task (available in `data/fixtures.tar.gz`) | | `language` | string | Task language (`en` or `zh`) | | `category` | string | Task domain | ## Usage ```python from datasets import load_dataset # Load all splits dataset = load_dataset("claw-eval/Claw-Eval") # Load a specific split general = load_dataset("claw-eval/Claw-Eval", split="general") multimodal = load_dataset("claw-eval/Claw-Eval", split="multimodal") multi_turn = load_dataset("claw-eval/Claw-Eval", split="multi_turn") # Inspect a sample print(general[0]) ``` ## Acknowledgements Our test cases are built on the work of the community. We draw from and adapt tasks contributed by OpenClaw, PinchBench, OfficeQA, OneMillion-Bench, Finance Agent, and Terminal-Bench 2.0. ## Citation If you use Claw-Eval in your research, please cite: ```bibtex @article{ye2026claw, title={Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents}, author={Ye, Bowen and Li, Rang and Yang, Qibin and Liu, Yuanxin and Yao, Linli and Lv, Hanglong and Xie, Zhihui and An, Chenxin and Li, Lei and Kong, Lingpeng and others}, journal={arXiv preprint arXiv:2604.06132}, year={2026} } ``` ## Core Contributors [Bowen Ye](https://github.com/pkuYmiracle)(PKU), [Rang Li](https://github.com/lirang04) (PKU), [Qibin Yang](https://github.com/yangqibin-caibi) (PKU), [Zhihui Xie](https://zhxie.site/)(HKU), [Yuanxin Liu](https://llyx97.github.io/)(PKU), [Linli Yao](https://yaolinli.github.io/)(PKU), [Hanglong Lyu](https://github.com/Albus2002)(PKU), [Lei Li](lilei-nlp.github.io)(HKU, project lead) ## Advisors: [Tong Yang](https://yangtonghome.github.io/) (PKU), [Zhifang Sui](https://cs.pku.edu.cn/info/1226/2014.htm) (PKU), [Lingpeng Kong](https://ikekonglp.github.io/) (HKU), [Qi Liu](https://leuchine.github.io/) (HKU) ## Contribution We welcome any kind of contribution. Let us know if you have any suggestions! ## License This dataset is released under the [MIT License](https://github.com/claw-eval/claw-eval/blob/main/LICENSE).

<div align="center"> <h1>Claw-Eval</h1> <img src="assets/claw_eval.png" alt="Claw-Eval 标志" width="200"> [![任务数](https://img.shields.io/badge/tasks-300-blue)](#dataset-structure) [![实时排行榜](https://img.shields.io/badge/leaderboard-live-purple)](https://claw-eval.github.io) [![许可证](https://img.shields.io/badge/license-MIT-orange)](https://github.com/claw-eval/claw-eval/blob/main/LICENSE) **面向真实世界中运行的AI智能体（AI Agent）的端到端可透明化评测基准。** [论文](https://huggingface.co/papers/2604.06132) | [排行榜](https://claw-eval.github.io) | [代码](https://github.com/claw-eval/claw-eval) </div> --- ## 数据集结构 ### 数据集划分 | 划分集名称 | 样本数量 | 描述 | |---|---:|---| | `general` | 161 | 涵盖24个类别（通信、金融、运维、生产力工具等）的核心智能体任务 | | `multimodal` | 101 | 需要感知与创作能力的多模态智能体任务，包括网页生成、视频问答、文档提取等 | | `multi_turn` | 38 | 多轮对话任务，智能体需与模拟用户角色交互以明确需求并提供建议 | ### 字段说明 | 字段名 | 类型 | 描述 | |---|---|---| | `task_id` | 字符串 | 任务唯一标识符 | | `query` | 字符串 | 任务指令/任务描述 | | `fixture` | 字符串列表 | 任务所需的配套文件（可在`data/fixtures.tar.gz`中获取） | | `language` | 字符串 | 任务语言，可选值为`en`（英语）或`zh`（中文） | | `category` | 字符串 | 任务所属领域 | ## 使用方法 python from datasets import load_dataset # 加载全部划分集 dataset = load_dataset("claw-eval/Claw-Eval") # 加载指定划分集 general = load_dataset("claw-eval/Claw-Eval", split="general") multimodal = load_dataset("claw-eval/Claw-Eval", split="multimodal") multi_turn = load_dataset("claw-eval/Claw-Eval", split="multi_turn") # 查看样本示例 print(general[0]) ## 致谢本评测的测试用例基于社区已有工作构建，我们借鉴并改编了OpenClaw、PinchBench、OfficeQA、OneMillion-Bench、Finance Agent以及Terminal-Bench 2.0所贡献的任务。 ## 引用格式若您在研究中使用Claw-Eval，请引用以下文献： bibtex @article{ye2026claw, title={Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents}, author={Ye, Bowen and Li, Rang and Yang, Qibin and Liu, Yuanxin and Yao, Linli and Lv, Hanglong and Xie, Zhihui and An, Chenxin and Li, Lei and Kong, Lingpeng and others}, journal={arXiv preprint arXiv:2604.06132}, year={2026} } ## 核心贡献者 [叶博文](https://github.com/pkuYmiracle)（北京大学）、[李让](https://github.com/lirang04)（北京大学）、[杨祺斌](https://github.com/yangqibin-caibi)（北京大学）、[谢志辉](https://zhxie.site/)（香港大学）、[刘元鑫](https://llyx97.github.io/)（北京大学）、[姚林丽](https://yaolinli.github.io/)（北京大学）、[吕航龙](https://github.com/Albus2002)（北京大学）、[李磊](https://lilei-nlp.github.io)（香港大学，项目负责人） ## 指导委员会 [杨童](https://yangtonghome.github.io/)（北京大学）、[隋志芳](https://cs.pku.edu.cn/info/1226/2014.htm)（北京大学）、[孔令鹏](https://ikekonglp.github.io/)（香港大学）、[刘琪](https://leuchine.github.io/)（香港大学） ## 贡献须知我们欢迎各类形式的贡献，若您有任何建议，欢迎随时与我们联系！ ## 许可证本数据集采用[MIT许可证](https://github.com/claw-eval/claw-eval/blob/main/LICENSE)发布。

提供机构：

maas

创建时间：

2026-04-28

5,000+

优质数据集

54 个

任务类型

进入经典数据集