ZClawBench
收藏魔搭社区2026-05-16 更新2026-03-29 收录
下载链接:
https://modelscope.cn/datasets/ZhipuAI/ZClawBench
下载链接
链接失效反馈官方服务:
资源简介:
# ZClawBench
## Overview
Recent advances in agent frameworks have pushed large language models beyond conversational assistance toward **goal-driven task execution**. Among these frameworks, OpenClaw has emerged as a representative setting for evaluating whether models can interact with tools, follow multi-step instructions, and complete practical tasks in realistic environments. Unlike traditional chatbot benchmarks, OpenClaw-style scenarios require models not only to produce plausible responses, but also to **take actions, coordinate external capabilities, and reliably finish end-to-end workflows**.
At the same time, real-world OpenClaw usage has rapidly expanded from technical tasks such as installation, configuration, and coding to broader productivity scenarios including office automation, information collection, data analysis, and content creation. This shift makes it increasingly important to evaluate models on **real user demands**, rather than on narrowly defined synthetic tasks or static single-turn question answering benchmarks.
Motivated by this gap, we build **ZClawBench**, a benchmark designed to reflect both **current high-frequency OpenClaw scenarios** and **fast-growing agent capabilities**. By grounding the benchmark in real user needs and evaluating models in an end-to-end agentic setting, ZClawBench aims to provide a more realistic measure of how well models can perform as general-purpose agents in practical OpenClaw workflows.
We are currently preparing the code for open-source release, including decoupling it from our internal framework, and expect to release it soon.
## Data Construction and Disclaimer
All examples in ZClawBench are constructed either through manual construction or automatic synthesis. No online production data or real-world user data is used in the creation of the benchmark.
To further avoid privacy or attribution concerns, all company names, organization names, and personal names appearing in the benchmark are artificially synthesized fictional entities. They do not correspond to any intended real-world subjects, and any accidental overlap with existing entities or individuals is purely coincidental.
## Task Distributions
The current version of ZClawBench contains 116 test cases, distributed as follows:
| Scenario | Task Count | Percentage |
| --- | ---: | ---: |
| Overall | 116 | 100.0% |
| Information Search & Gathering | 22 | 19.0% |
| Office & Daily Tasks | 35 | 30.2% |
| Data Analysis | 10 | 8.6% |
| Development & Operations | 19 | 16.4% |
| Automation | 20 | 17.2% |
| Security | 10 | 8.6% |
Among them, **35 out of 116 test cases require the agent to use skills**.
## Evaluation Methodology
A key challenge in evaluating OpenClaw-style agent tasks is that **different task types require different evaluation methods**.
For example, deployment-oriented tasks mainly care about whether the environment is configured correctly and can run successfully, while report-generation tasks focus much more on the quality, completeness, and usefulness of the generated content. Because of this heterogeneity, a single evaluation strategy is insufficient for OpenClaw scenarios.
To address this, we designed a **three-level evaluation framework**:
### 1. Script-based Verification
For tasks whose outcomes can be checked through explicit rules, assertions, or executable programs, we adopt **script-based verification**.
Examples include generating a file with a specified name in a target directory, passing unit tests, or producing outputs that satisfy deterministic constraints. These tasks can be evaluated objectively and reproducibly through scripts.
### 2. Agentic Point-wise Verification
For tasks that cannot be reliably assessed with simple rules, but can still be judged against a set of requirements, we use **agentic point-wise verification**.
Examples include configuring an agent with a particular persona or behavior style, collecting specific information from the past seven days, or implementing a small interactive game with required functionalities. In such cases, an **agentic judge** determines whether the instruction has been satisfactorily fulfilled.
### 3. Agentic Pair-wise Evaluation
For tasks where content quality is the primary concern, we adopt agentic pair-wise evaluation.
Examples include generating a report on a topic over a given time period, where evaluation depends not only on whether the task is completed, but also on the relative quality of the output. For these open-ended and diverse tasks, we find pair-wise comparison to be more reliable than absolute scoring. In practice, we observe that agentic pair-wise evaluation is better suited for open-ended tasks with high diversity and multiple valid solution paths.
Specifically, the evaluated output is compared against a fixed baseline response, and the result is scored as 1 / 0.5 / 0 according to the win / tie / lose outcome.
### Scoring
For the first two categories of tasks, each test case may contain **one or more checklist items**. The score of an individual test case is defined as the **pass rate over its checklist items**. The overall benchmark score is then computed as the **average score across all test cases**.
Formally, assume the benchmark contains multiple test cases. Each test case includes one or more checklist items, and each checklist item is assigned a value of 1 if it is passed and 0 otherwise. The score of a test case is defined as the average of its checklist item values:
$$
s_i = \frac{1}{M_i} \sum_{j=1}^{M_i} c_{ij}
$$
and the final benchmark score is:
$$
\mathrm{Score}_{\text{bench}} = \frac{1}{N} \sum_{i=1}^{N} s_i
$$
For pair-wise evaluation tasks, the per-case score is directly assigned as **1**, **0.5**, or **0** based on the **win / tie / lose** outcome, and is incorporated into the final benchmark score in the same averaging framework.
### Evaluation Environment
In addition, both the generation framework and the agentic judge framework are deployed in **fixed and isolated Docker environments** to minimize interference and ensure reproducibility.
To simulate certain complex environments—such as databases or email sending systems—while reducing unnecessary uncertainty, we apply **mock interfaces** for some tasks. As long as the agent invokes the interface with the correct parameters, the system returns a fixed response. This design preserves the realism of tool use while avoiding noise introduced by unstable external services or environments.
## Detailed Results
The radar chart below provides a compact comparison of model performance across the main ZClawBench scenario groups. It highlights the different capability profiles of the evaluated models in OpenClaw-style agent tasks, complementing the detailed score table that follows.
<div align="center">
<img src="./assets/zclawbench_model_radar_chart.jpeg" alt="ZClawBench model radar chart" width="70%">
<p><em></em></p>
</div>
| Scenario | Task Count | Claude-4.6-opus | GLM5-turbo | Gemini-3.1-Pro | GLM5 | Mimimax-M2.5 | Kimi-K2.5 |
| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| Overall | 116 | **0.654** | 0.564 | 0.503 | 0.482 | 0.408 | 0.402 |
| Information Search & Gathering | 22 | **0.687** | 0.449 | 0.345 | 0.401 | 0.225 | 0.198 |
| Office & Daily Tasks | 35 | **0.534** | 0.523 | 0.435 | 0.402 | 0.355 | 0.339 |
| Data Analysis | 10 | **0.725** | 0.578 | 0.362 | 0.412 | 0.368 | 0.388 |
| Development & Operations | 19 | **0.612** | 0.507 | 0.487 | 0.602 | 0.395 | 0.525 |
| Automation | 20 | **0.750** | 0.700 | 0.700 | 0.500 | 0.550 | 0.450 |
| Security | 10 | 0.820 | 0.780 | **0.860** | 0.740 | 0.780 | 0.760 |
## Dataset Usage
This dataset can be loaded using the Hugging Face `datasets` library:
```python
from datasets import load_dataset
# Load the dataset
dataset = load_dataset("zai-org/ZClawBench")
# Access the training split
train_data = dataset["train"]
# View dataset structure
print(train_data)
print(train_data.features)
# Access a specific sample
sample = train_data[0]
trajectory = sample["trajectory"]
model_name = sample["model_name"]
task_id = sample["task_id"]
task_category = sample["task_category"]
```
## Dataset Structure
The dataset contains the following fields:
- `task_id`: Unique identifier for the ZClawBench test case
- `trajectory`: Full agent trajectory for a model on the task
- `model_name`: Name of the evaluated model
- `task_category`: Category of the task, such as Office & Daily Tasks or Development & Operations
# ZClawBench
## 概述
智能体框架领域的最新进展,已将大语言模型(Large Language Model, LLM)从对话辅助工具拓展至**目标驱动型任务执行**场景。在此类框架中,OpenClaw已成为评估模型能否与工具交互、遵循多步指令并在真实环境中完成实际任务的典型测试场景。与传统聊天机器人基准测试不同,OpenClaw风格的场景不仅要求模型生成合理响应,更需要其**采取行动、协调外部能力并可靠完成端到端工作流**。
与此同时,现实场景中OpenClaw的应用范畴已从安装、配置、编码等技术任务,快速拓展至办公自动化、信息采集、数据分析、内容创作等更广泛的生产力场景。这一变化使得基于**真实用户需求**而非狭义合成任务或静态单轮问答基准来评估模型的重要性日益凸显。
针对这一现有不足,我们构建了**ZClawBench**——一款旨在同时反映**当前高频OpenClaw场景**与**快速演进的智能体能力**的基准测试集。该基准以真实用户需求为根基,并在端到端智能体环境中评估模型,旨在更真实地衡量模型在实际OpenClaw工作流中作为通用智能体的表现水平。
目前我们正在筹备该基准的开源代码发布工作,包括将其与内部框架解耦,预计不久后即可上线。
## 数据构建与免责声明
ZClawBench中的所有示例均通过人工构建或自动合成生成。本基准的构建未使用任何线上生产数据或真实用户数据。
为进一步规避隐私或归属权争议,基准中出现的所有公司名、组织名与个人名均为人工合成的虚构实体,无意指代任何现实主体。若与现有实体或个人偶然重合,纯属巧合。
## 任务分布
当前版本的ZClawBench共包含116个测试用例,分布如下:
| 场景 | 任务数量 | 占比 |
| --- | ---: | ---: |
| 总计 | 116 | 100.0% |
| 信息搜索与采集 | 22 | 19.0% |
| 办公与日常任务 | 35 | 30.2% |
| 数据分析 | 10 | 8.6% |
| 开发与运维 | 19 | 16.4% |
| 自动化 | 20 | 17.2% |
| 安全 | 10 | 8.6% |
其中,116个测试用例中有35个要求智能体调用技能。
## 评估方法
评估OpenClaw风格的智能体任务时,核心挑战之一在于**不同任务类型需要适配差异化的评估方法**。
例如,面向部署的任务主要关注环境配置是否正确、能否成功运行,而报告生成类任务则更侧重生成内容的质量、完整性与实用性。由于这种异质性,单一评估策略无法覆盖OpenClaw场景的全部需求。
为此,我们设计了**三级评估框架**:
### 1. 基于脚本的验证
对于可通过明确规则、断言或可执行程序核验结果的任务,我们采用**基于脚本的验证**方式。
示例包括在目标目录生成指定名称的文件、通过单元测试,或生成满足确定性约束的输出。此类任务可通过脚本实现客观且可复现的评估。
### 2. 智能体逐点验证
对于无法通过简单规则可靠评估,但可依据一组要求进行判断的任务,我们采用**智能体逐点验证**方式。
示例包括为智能体配置特定人设或行为风格、从过去7天内采集特定信息,或实现具备指定功能的小型交互式游戏。此类场景下,由**智能体裁判**判断指令是否被圆满完成。
### 3. 智能体成对评估
对于以内容质量为核心考量的任务,我们采用智能体成对评估方式。
示例包括针对给定时间段内的主题生成报告,此类评估不仅需要判断任务是否完成,还需考量输出内容的相对质量。对于这类开放且多样化的任务,我们发现成对比较比绝对评分更可靠。实际应用中,智能体成对评估更适配多样性高、存在多种有效解决路径的开放型任务。
具体而言,将待评估输出与固定的基准响应进行对比,根据胜出、平局、落败的结果分别记为1/0.5/0分。
### 评分规则
对于前两类任务,每个测试用例可能包含**一个或多个核查清单条目**。单个测试用例的得分定义为**其核查清单条目的通过率**。基准的最终总得分为**所有测试用例得分的平均值**。
形式化而言,假设基准包含N个测试用例,每个测试用例包含M_i个核查清单条目,每个条目通过则记为1、未通过则记为0。单个测试用例的得分定义为其所有核查条目得分的平均值:
$$s_i = frac{1}{M_i} sum_{j=1}^{M_i} c_{ij}$$
最终基准总得分为:
$$mathrm{Score}_{ ext{bench}} = frac{1}{N} sum_{i=1}^{N} s_i$$
对于成对评估任务,单个用例得分直接根据胜出/平局/落败结果记为1、0.5或0,并按照相同的平均框架纳入最终基准得分。
### 评估环境
此外,生成框架与智能体裁判框架均部署于**固定且隔离的Docker环境**中,以最小化干扰并确保评估可复现。
为模拟部分复杂环境(如数据库或邮件发送系统)同时减少不必要的不确定性,我们为部分任务采用**模拟接口**。只要智能体以正确参数调用接口,系统将返回固定响应。该设计在保留工具使用真实性的同时,规避了不稳定外部服务或环境引入的噪声。
## 详细结果
下述雷达图可直观对比各模型在ZClawBench主要场景组中的表现,其凸显了被评估模型在OpenClaw风格智能体任务中的不同能力画像,可作为后续详细得分表的补充参考。
<div align="center">
<img src="./assets/zclawbench_model_radar_chart.jpeg" alt="ZClawBench模型雷达图" width="70%">
<p><em></em></p>
</div>
| 场景 | 任务数量 | Claude-4.6-opus | GLM5-turbo | Gemini-3.1-Pro | GLM5 | Mimimax-M2.5 | Kimi-K2.5 |
| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| 总计 | 116 | **0.654** | 0.564 | 0.503 | 0.482 | 0.408 | 0.402 |
| 信息搜索与采集 | 22 | **0.687** | 0.449 | 0.345 | 0.401 | 0.225 | 0.198 |
| 办公与日常任务 | 35 | **0.534** | 0.523 | 0.435 | 0.402 | 0.355 | 0.339 |
| 数据分析 | 10 | **0.725** | 0.578 | 0.362 | 0.412 | 0.368 | 0.388 |
| 开发与运维 | 19 | **0.612** | 0.507 | 0.487 | 0.602 | 0.395 | 0.525 |
| 自动化 | 20 | **0.750** | 0.700 | 0.700 | 0.500 | 0.550 | 0.450 |
| 安全 | 10 | 0.820 | 0.780 | **0.860** | 0.740 | 0.780 | 0.760 |
## 数据集使用
可通过Hugging Face的`datasets`库加载该数据集:
python
from datasets import load_dataset
# 加载数据集
dataset = load_dataset("zai-org/ZClawBench")
# 访问训练拆分
train_data = dataset["train"]
# 查看数据集结构
print(train_data)
print(train_data.features)
# 访问特定样本
sample = train_data[0]
trajectory = sample["trajectory"]
model_name = sample["model_name"]
task_id = sample["task_id"]
task_category = sample["task_category"]
## 数据集结构
该数据集包含以下字段:
- `task_id`:ZClawBench测试用例的唯一标识符
- `trajectory`:模型在该任务上的完整智能体执行轨迹
- `model_name`:被评估模型的名称
- `task_category`:任务类别,例如办公与日常任务或开发与运维
提供机构:
maas
创建时间:
2026-03-19



