claw-eval/Claw-Eval
收藏Hugging Face2026-03-25 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/claw-eval/Claw-Eval
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: task_id
dtype: string
- name: query
dtype: string
- name: fixture
list: string
- name: language
dtype: string
- name: category
dtype: string
- name: rubric
dtype: large_string
splits:
- name: general
num_bytes: 200118
num_examples: 104
- name: multimodal
num_bytes: 72393
num_examples: 35
download_size: 155773
dataset_size: 272511
configs:
- config_name: default
data_files:
- split: general
path: data/general-*
- split: multimodal
path: data/multimodal-*
language:
- en
- zh
license: mit
tags:
- agent-bench
- evaluation
- real-world
- multimodal
pretty_name: Claw-Eval
size_categories:
- n<1K
---
<div align="center">
<h1>Claw-Eval</h1>
<img src="assets/claw_eval.png" alt="Claw-Eval Logo" width="200">
[](#dataset-structure)
[](https://claw-eval.github.io)
[](https://claw-eval.github.io)
[](https://github.com/claw-eval/claw-eval/blob/main/LICENSE)
**End-to-end transparent benchmark for AI agents acting in the real world.**
[Leaderboard](https://claw-eval.github.io) | [Code](https://github.com/claw-eval/claw-eval)
</div>
---
## Dataset Structure
### Splits
| Split | Examples | Description |
|---|---:|---|
| `general` | 104 | Core agent tasks across 24 categories (communication, finance, ops, productivity, etc.) |
| `multimodal` | 35 | Multimodal agentic tasks requiring perception and creation (webpage generation, video QA, document extraction, etc.) |
### Fields
| Field | Type | Description |
|---|---|---|
| `task_id` | string | Unique task identifier |
| `query` | string | Task instruction / description |
| `fixture` | list[string] | Fixture files required for the task (available in `data/fixtures.tar.gz`) |
| `language` | string | Task language (`en` or `zh`) |
| `category` | string | Task domain |
| `rubric` | string | Detailed evaluation criteria with weighted scoring |
## Usage
```python
from datasets import load_dataset
# Load all splits
dataset = load_dataset("claw-eval/Claw-Eval")
# Load a specific split
general = load_dataset("claw-eval/Claw-Eval", split="general")
multimodal = load_dataset("claw-eval/Claw-Eval", split="multimodal")
# Inspect a sample
print(general[0])
```
## Acknowledgements
Our test cases are built on the work of the community. We draw from and adapt tasks contributed by OpenClaw, PinchBench, OfficeQA, OneMillion-Bench, Finance Agent, and Terminal-Bench 2.0.
## Citation
If you use Claw-Eval in your research, please cite:
```bibtex
@misc{claw-eval2026,
title={Claw-Eval: End-to-End Transparent Benchmark for AI Agents in the Real World},
author={Ye, Bowen and Li, Rang and Yang, Qibin and Xie, Zhihui and Li, Lei},
year={2026},
url={https://github.com/claw-eval/claw-eval}
}
```
## Contributors
[Bowen Ye*](https://github.com/pkuYmiracle) (PKU), [Rang Li*](https://github.com/lirang04) (PKU), [Qibin Yang*](https://github.com/yangqibin-caibi) (PKU), [Zhihui Xie](https://zhxie.site/) (HKU), [Lei Li](https://lilei-nlp.github.io)<sup>†</sup> (HKU, Project Lead)
Advisors: [Tong Yang](https://yangtonghome.github.io/) (PKU), [Zhifang Sui](https://cs.pku.edu.cn/info/1226/2014.htm) (PKU), [Lingpeng Kong](https://ikekonglp.github.io/) (HKU), [Qi Liu](https://leuchine.github.io/) (HKU)
## Contribution
We welcome any kind of contribution. Let us know if you have any suggestions!
## License
This dataset is released under the [MIT License](https://github.com/claw-eval/claw-eval/blob/main/LICENSE).
提供机构:
claw-eval



