five

claw-eval/Claw-Eval

收藏
Hugging Face2026-03-25 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/claw-eval/Claw-Eval
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: task_id dtype: string - name: query dtype: string - name: fixture list: string - name: language dtype: string - name: category dtype: string - name: rubric dtype: large_string splits: - name: general num_bytes: 200118 num_examples: 104 - name: multimodal num_bytes: 72393 num_examples: 35 download_size: 155773 dataset_size: 272511 configs: - config_name: default data_files: - split: general path: data/general-* - split: multimodal path: data/multimodal-* language: - en - zh license: mit tags: - agent-bench - evaluation - real-world - multimodal pretty_name: Claw-Eval size_categories: - n<1K --- <div align="center"> <h1>Claw-Eval</h1> <img src="assets/claw_eval.png" alt="Claw-Eval Logo" width="200"> [![Tasks](https://img.shields.io/badge/tasks-139-blue)](#dataset-structure) [![Models](https://img.shields.io/badge/models-23-green)](https://claw-eval.github.io) [![Leaderboard](https://img.shields.io/badge/leaderboard-live-purple)](https://claw-eval.github.io) [![License](https://img.shields.io/badge/license-MIT-orange)](https://github.com/claw-eval/claw-eval/blob/main/LICENSE) **End-to-end transparent benchmark for AI agents acting in the real world.** [Leaderboard](https://claw-eval.github.io) | [Code](https://github.com/claw-eval/claw-eval) </div> --- ## Dataset Structure ### Splits | Split | Examples | Description | |---|---:|---| | `general` | 104 | Core agent tasks across 24 categories (communication, finance, ops, productivity, etc.) | | `multimodal` | 35 | Multimodal agentic tasks requiring perception and creation (webpage generation, video QA, document extraction, etc.) | ### Fields | Field | Type | Description | |---|---|---| | `task_id` | string | Unique task identifier | | `query` | string | Task instruction / description | | `fixture` | list[string] | Fixture files required for the task (available in `data/fixtures.tar.gz`) | | `language` | string | Task language (`en` or `zh`) | | `category` | string | Task domain | | `rubric` | string | Detailed evaluation criteria with weighted scoring | ## Usage ```python from datasets import load_dataset # Load all splits dataset = load_dataset("claw-eval/Claw-Eval") # Load a specific split general = load_dataset("claw-eval/Claw-Eval", split="general") multimodal = load_dataset("claw-eval/Claw-Eval", split="multimodal") # Inspect a sample print(general[0]) ``` ## Acknowledgements Our test cases are built on the work of the community. We draw from and adapt tasks contributed by OpenClaw, PinchBench, OfficeQA, OneMillion-Bench, Finance Agent, and Terminal-Bench 2.0. ## Citation If you use Claw-Eval in your research, please cite: ```bibtex @misc{claw-eval2026, title={Claw-Eval: End-to-End Transparent Benchmark for AI Agents in the Real World}, author={Ye, Bowen and Li, Rang and Yang, Qibin and Xie, Zhihui and Li, Lei}, year={2026}, url={https://github.com/claw-eval/claw-eval} } ``` ## Contributors [Bowen Ye*](https://github.com/pkuYmiracle) (PKU), [Rang Li*](https://github.com/lirang04) (PKU), [Qibin Yang*](https://github.com/yangqibin-caibi) (PKU), [Zhihui Xie](https://zhxie.site/) (HKU), [Lei Li](https://lilei-nlp.github.io)<sup>†</sup> (HKU, Project Lead) Advisors: [Tong Yang](https://yangtonghome.github.io/) (PKU), [Zhifang Sui](https://cs.pku.edu.cn/info/1226/2014.htm) (PKU), [Lingpeng Kong](https://ikekonglp.github.io/) (HKU), [Qi Liu](https://leuchine.github.io/) (HKU) ## Contribution We welcome any kind of contribution. Let us know if you have any suggestions! ## License This dataset is released under the [MIT License](https://github.com/claw-eval/claw-eval/blob/main/LICENSE).
提供机构:
claw-eval
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作