Claw-Eval
收藏魔搭社区2026-05-17 更新2026-05-10 收录
下载链接:
https://modelscope.cn/datasets/claw-eval/Claw-Eval
下载链接
链接失效反馈官方服务:
资源简介:
<div align="center">
<h1>Claw-Eval</h1>
<img src="assets/claw_eval.png" alt="Claw-Eval Logo" width="200">
[](#dataset-structure)
[](https://claw-eval.github.io)
[](https://github.com/claw-eval/claw-eval/blob/main/LICENSE)
**End-to-end transparent benchmark for AI agents acting in the real world.**
[Paper](https://huggingface.co/papers/2604.06132) | [Leaderboard](https://claw-eval.github.io) | [Code](https://github.com/claw-eval/claw-eval)
</div>
---
## Dataset Structure
### Splits
| Split | Examples | Description |
|---|---:|---|
| `general` | 161 | Core agent tasks across 24 categories (communication, finance, ops, productivity, etc.) |
| `multimodal` | 101 | Multimodal agentic tasks requiring perception and creation (webpage generation, video QA, document extraction, etc.) |
| `multi_turn` | 38 | Multi-turn conversational tasks where the agent interacts with a simulated user persona to clarify needs and provide advice |
### Fields
| Field | Type | Description |
|---|---|---|
| `task_id` | string | Unique task identifier |
| `query` | string | Task instruction / description |
| `fixture` | list[string] | Fixture files required for the task (available in `data/fixtures.tar.gz`) |
| `language` | string | Task language (`en` or `zh`) |
| `category` | string | Task domain |
## Usage
```python
from datasets import load_dataset
# Load all splits
dataset = load_dataset("claw-eval/Claw-Eval")
# Load a specific split
general = load_dataset("claw-eval/Claw-Eval", split="general")
multimodal = load_dataset("claw-eval/Claw-Eval", split="multimodal")
multi_turn = load_dataset("claw-eval/Claw-Eval", split="multi_turn")
# Inspect a sample
print(general[0])
```
## Acknowledgements
Our test cases are built on the work of the community. We draw from and adapt tasks contributed by OpenClaw, PinchBench, OfficeQA, OneMillion-Bench, Finance Agent, and Terminal-Bench 2.0.
## Citation
If you use Claw-Eval in your research, please cite:
```bibtex
@article{ye2026claw,
title={Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents},
author={Ye, Bowen and Li, Rang and Yang, Qibin and Liu, Yuanxin and Yao, Linli and Lv, Hanglong and Xie, Zhihui and An, Chenxin and Li, Lei and Kong, Lingpeng and others},
journal={arXiv preprint arXiv:2604.06132},
year={2026}
}
```
## Core Contributors
[Bowen Ye](https://github.com/pkuYmiracle)(PKU), [Rang Li](https://github.com/lirang04) (PKU), [Qibin Yang](https://github.com/yangqibin-caibi) (PKU), [Zhihui Xie](https://zhxie.site/)(HKU), [Yuanxin Liu](https://llyx97.github.io/)(PKU), [Linli Yao](https://yaolinli.github.io/)(PKU), [Hanglong Lyu](https://github.com/Albus2002)(PKU), [Lei Li](lilei-nlp.github.io)(HKU, project lead)
## Advisors:
[Tong Yang](https://yangtonghome.github.io/) (PKU), [Zhifang Sui](https://cs.pku.edu.cn/info/1226/2014.htm) (PKU), [Lingpeng Kong](https://ikekonglp.github.io/) (HKU), [Qi Liu](https://leuchine.github.io/) (HKU)
## Contribution
We welcome any kind of contribution. Let us know if you have any suggestions!
## License
This dataset is released under the [MIT License](https://github.com/claw-eval/claw-eval/blob/main/LICENSE).
<div align="center">
<h1>Claw-Eval</h1>
<img src="assets/claw_eval.png" alt="Claw-Eval 标志" width="200">
[](#dataset-structure)
[](https://claw-eval.github.io)
[](https://github.com/claw-eval/claw-eval/blob/main/LICENSE)
**面向真实世界中运行的AI智能体(AI Agent)的端到端可透明化评测基准。**
[论文](https://huggingface.co/papers/2604.06132) | [排行榜](https://claw-eval.github.io) | [代码](https://github.com/claw-eval/claw-eval)
</div>
---
## 数据集结构
### 数据集划分
| 划分集名称 | 样本数量 | 描述 |
|---|---:|---|
| `general` | 161 | 涵盖24个类别(通信、金融、运维、生产力工具等)的核心智能体任务 |
| `multimodal` | 101 | 需要感知与创作能力的多模态智能体任务,包括网页生成、视频问答、文档提取等 |
| `multi_turn` | 38 | 多轮对话任务,智能体需与模拟用户角色交互以明确需求并提供建议 |
### 字段说明
| 字段名 | 类型 | 描述 |
|---|---|---|
| `task_id` | 字符串 | 任务唯一标识符 |
| `query` | 字符串 | 任务指令/任务描述 |
| `fixture` | 字符串列表 | 任务所需的配套文件(可在`data/fixtures.tar.gz`中获取) |
| `language` | 字符串 | 任务语言,可选值为`en`(英语)或`zh`(中文) |
| `category` | 字符串 | 任务所属领域 |
## 使用方法
python
from datasets import load_dataset
# 加载全部划分集
dataset = load_dataset("claw-eval/Claw-Eval")
# 加载指定划分集
general = load_dataset("claw-eval/Claw-Eval", split="general")
multimodal = load_dataset("claw-eval/Claw-Eval", split="multimodal")
multi_turn = load_dataset("claw-eval/Claw-Eval", split="multi_turn")
# 查看样本示例
print(general[0])
## 致谢
本评测的测试用例基于社区已有工作构建,我们借鉴并改编了OpenClaw、PinchBench、OfficeQA、OneMillion-Bench、Finance Agent以及Terminal-Bench 2.0所贡献的任务。
## 引用格式
若您在研究中使用Claw-Eval,请引用以下文献:
bibtex
@article{ye2026claw,
title={Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents},
author={Ye, Bowen and Li, Rang and Yang, Qibin and Liu, Yuanxin and Yao, Linli and Lv, Hanglong and Xie, Zhihui and An, Chenxin and Li, Lei and Kong, Lingpeng and others},
journal={arXiv preprint arXiv:2604.06132},
year={2026}
}
## 核心贡献者
[叶博文](https://github.com/pkuYmiracle)(北京大学)、[李让](https://github.com/lirang04)(北京大学)、[杨祺斌](https://github.com/yangqibin-caibi)(北京大学)、[谢志辉](https://zhxie.site/)(香港大学)、[刘元鑫](https://llyx97.github.io/)(北京大学)、[姚林丽](https://yaolinli.github.io/)(北京大学)、[吕航龙](https://github.com/Albus2002)(北京大学)、[李磊](https://lilei-nlp.github.io)(香港大学,项目负责人)
## 指导委员会
[杨童](https://yangtonghome.github.io/)(北京大学)、[隋志芳](https://cs.pku.edu.cn/info/1226/2014.htm)(北京大学)、[孔令鹏](https://ikekonglp.github.io/)(香港大学)、[刘琪](https://leuchine.github.io/)(香港大学)
## 贡献须知
我们欢迎各类形式的贡献,若您有任何建议,欢迎随时与我们联系!
## 许可证
本数据集采用[MIT许可证](https://github.com/claw-eval/claw-eval/blob/main/LICENSE)发布。
提供机构:
maas
创建时间:
2026-04-28



