Mosi-AI/LiveClawbench-trajectories
收藏Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Mosi-AI/LiveClawbench-trajectories
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- text-generation
language:
- en
tags:
- agent
- benchmark
- openclaw
- trajectory
- leaderboard
size_categories:
- n<1K
pretty_name: "LiveClawBench-traj"
version: "0.1.0"
---
<h1 align="center">LiveClawBench: Benchmarking LLM Agents on Complex, Real-World Assistant Tasks</h1>
<!-- [](https://github.com/Mosi-AI/LiveClawBench/releases/download/v0.1-preprint/LiveClawBench.pdf)
[](https://github.com/Mosi-AI/LiveClawBench)
[](https://github.com/Mosi-AI/LiveClawBench/blob/main/LICENSE)
[](https://github.com/Mosi-AI/LiveClawBench/tree/main/tasks) -->
<p align="center">
📖 <a href="https://github.com/Mosi-AI/LiveClawBench/releases/download/v0.1-preprint/LiveClawBench.pdf">Paper</a> | 🛠️ <a href="https://github.com/Mosi-AI/LiveClawBench">GitHub</a> | 🧩 <a href="https://github.com/Mosi-AI/LiveClawBench/tree/main/tasks">Tasks</a>
</p>
## Overview
LLM agents are increasingly expected to handle real-world assistant tasks — booking flights, managing emails, debugging code, curating knowledge bases — yet existing benchmarks evaluate them under isolated difficulty sources. **LiveClawBench** addresses this gap by introducing a **Triple-Axis Complexity Framework** and building a pilot benchmark of 30 manually constructed tasks with explicit factor annotations, controlled pairs, deterministic mock environments, and outcome-driven evaluation.
The core question lies in:
- **how current LLM based agents perform confonting real-world scenario tasks**
- **how does agent capability degrade when tasks stack multiple complexity factors?**
This dataset release includes:
- 🏆 **Leaderboard** scores for 7 open-source models (Avg@3) on the LiveClawbench
- 📊 **630 agent trajectories** in ATIF-v1.2 format (7 models × 30 tasks × 3 runs)
All tasks run inside isolated Docker containers orchestrated by [Harbor](https://github.com/Mosi-AI/claw-harbor), with the [OpenClaw](https://github.com/openclaw/openclaw) agent platform executing inside each container.
---
## Benchmark Overview
### Triple-Axis Complexity Framework
LiveClawBench defines four orthogonal complexity factors that characterize structural sources of difficulty beyond baseline task execution:
| Factor | Axis | Description | # Tasks |
| ------------------------------------- | ----------- | ---------------------------------------------------------------------------------------- | -------:|
| **A1** — Cross-Service Dependency | Environment | Coordinate multiple independent services (email, airline, calendar) in a single workflow | 10 |
| **A2** — Contaminated Initial State | Environment | Environment starts broken or corrupt; agent must diagnose and repair before acting | 6 |
| **B1** — Implicit Goal Resolution | Cognitive | Goal is not stated explicitly; agent must infer constraints or seek clarification | 4 |
| **B2** — Knowledge System Maintenance | Cognitive | Create, update, resolve conflicts in, or manage a persistent skill/knowledge repository | 11 |
Factor combination distribution across the 30 tasks:
| Combination | Count | Percentage |
| --------------------- | -----:| ----------:|
| No factors (baseline) | 8 | 26.7% |
| Single factor | 14 | 46.7% |
| Dual factor | 7 | 23.3% |
| Triple factor | 1 | 3.3% |
### Task Distribution
The benchmark covers 7 primary domains across 3 difficulty levels:
| Domain | Easy | Medium | Hard | Total |
| ----------------------- | ------:| ------:| -----:| ------:|
| E-commerce & Daily Svcs | 7 | 1 | 3 | 11 |
| Documents & Knowledge | 6 | 3 | — | 9 |
| Communication & Email | 2 | — | — | 2 |
| Calendar & Task Mgmt | 1 | 1 | — | 2 |
| Coding & Software Dev | 2 | — | — | 2 |
| DevOps & Env Repair | — | — | 2 | 2 |
| Deep Research & Report | — | 2 | — | 2 |
| **Total** | **18** | **7** | **5** | **30** |
### Difficulty Calibration
Difficulty labels (E/M/H) are **empirically calibrated**, not designer-assigned. We ran 3 calibration models (MiniMax-M2.5, Kimi-K2.5, GLM-5) with 3 trials per task, computed per-task average solve rates, and applied the following thresholds:
| Label | Solve Rate Range | Count |
| ---------- | ---------------- | ----- |
| Easy (E) | > 0.7 | 18 |
| Medium (M) | (0.3, 0.7] | 7 |
| Hard (H) | [0, 0.3] | 5 |
---
## Leaderboard
### Overall Performance
All scores are **Avg@3** (mean of 3 independent runs per task, then averaged across 30 tasks). The scores below are rescaled from [0,1] to [0,100] for readability.
| # | Model | Avg | Easy | Medium | Hard |
| ---:| --------------------- | --------:| ----:| ------:| ----:|
| 1 | **Qwen3.5-397B-A17B** | **72.6** | 92.7 | 58.4 | 20.0 |
| 2 | MiniMax-M2.7 | 71.2 | 92.6 | 54.5 | 17.8 |
| 3 | GLM-5 | 69.9 | 92.6 | 58.1 | 4.8 |
| 4 | GLM-5-Turbo | 66.5 | 80.5 | 52.9 | 35.2 |
| 5 | Qwen3.5-122B-A10B | 64.4 | 83.6 | 60.4 | 1.1 |
| 6 | Qwen3.5-27B | 64.2 | 83.4 | 51.1 | 13.1 |
| 7 | Qwen3.5-35B-A3B | 58.3 | 75.9 | 47.6 | 9.6 |
<details>
<summary>📊 Visual comparison</summary>
```
Overall Avg@3 Score
Qwen3.5-397B-A17B ████████████████████████████████████▎ 72.6
MiniMax-M2.7 ███████████████████████████████████▌ 71.2
GLM-5 ██████████████████████████████████▉ 69.9
GLM-5-Turbo █████████████████████████████████▎ 66.5
Qwen3.5-122B-A10B ████████████████████████████████▎ 64.4
Qwen3.5-27B ████████████████████████████████▏ 64.2
Qwen3.5-35B-A3B █████████████████████████████▏ 58.3
0 20 40 60 80 100
```
</details>
### Performance by Complexity Factor
This table reveals **how much each complexity factor degrades agent performance**. "w/o" = average score on tasks without the factor; "w/" = average on tasks with the factor; "Δ" = the drop.
**A1 — Cross-Service Dependency** (20 tasks w/o → 10 tasks w/)
| Model | w/o Factor | w/ Factor | Δ |
| ----------------- | ----------:| ---------:| -----:|
| Qwen3.5-397B-A17B | 76.2 | 65.3 | −10.9 |
| MiniMax-M2.7 | 74.9 | 63.7 | −11.2 |
| GLM-5 | 73.6 | 62.4 | −11.2 |
| GLM-5-Turbo | 72.7 | 54.1 | −18.6 |
| Qwen3.5-122B-A10B | 73.9 | 45.6 | −28.3 |
| Qwen3.5-27B | 72.5 | 47.6 | −24.9 |
| Qwen3.5-35B-A3B | 67.8 | 39.3 | −28.5 |
**A2 — Contaminated Initial State** (24 tasks w/o → 6 tasks w/)
| Model | w/o Factor | w/ Factor | Δ |
| ----------------- | ----------:| ---------:| -----:|
| Qwen3.5-397B-A17B | 75.7 | 60.2 | −15.5 |
| MiniMax-M2.7 | 74.2 | 59.0 | −15.2 |
| GLM-5 | 72.5 | 59.5 | −13.0 |
| GLM-5-Turbo | 66.0 | 68.6 | +2.7 |
| Qwen3.5-122B-A10B | 65.3 | 61.0 | −4.3 |
| Qwen3.5-27B | 64.0 | 65.1 | +1.1 |
| Qwen3.5-35B-A3B | 57.0 | 63.2 | +6.1 |
**B1 — Implicit Goal Resolution** (26 tasks w/o → 4 tasks w/)
| Model | w/o Factor | w/ Factor | Δ |
| ----------------- | ----------:| ---------:| -----:|
| Qwen3.5-397B-A17B | 77.3 | 41.7 | −35.7 |
| MiniMax-M2.7 | 76.0 | 40.0 | −36.0 |
| GLM-5 | 73.7 | 45.0 | −28.7 |
| GLM-5-Turbo | 72.4 | 28.3 | −44.0 |
| Qwen3.5-122B-A10B | 71.3 | 20.0 | −51.3 |
| Qwen3.5-27B | 68.4 | 36.7 | −31.7 |
| Qwen3.5-35B-A3B | 64.2 | 20.0 | −44.2 |
**B2 — Knowledge System Maintenance** (19 tasks w/o → 11 tasks w/)
| Model | w/o Factor | w/ Factor | Δ |
| ----------------- | ----------:| ---------:| -----:|
| Qwen3.5-397B-A17B | 72.2 | 73.3 | +1.1 |
| MiniMax-M2.7 | 70.7 | 72.0 | +1.3 |
| GLM-5 | 70.6 | 68.7 | −1.9 |
| GLM-5-Turbo | 66.7 | 66.1 | −0.6 |
| Qwen3.5-122B-A10B | 60.4 | 71.5 | +11.1 |
| Qwen3.5-27B | 61.6 | 68.6 | +7.0 |
| Qwen3.5-35B-A3B | 51.5 | 70.0 | +18.5 |
> **Key findings:**
>
> - **B1 (Implicit Goal Resolution)** causes the most severe degradation across all models (−28.7 to −51.3), confirming that autonomous constraint inference and fallback reasoning remain the hardest challenge.
> - **A1 (Cross-Service Dependency)** consistently degrades all models (−10.9 to −28.5), with smaller models suffering disproportionately.
> - **A2 (Contaminated Initial State)** shows mixed results: larger models degrade significantly while smaller models like GLM-5-Turbo and Qwen3.5-35B actually improve, possibly because A2 tasks overlap with coding/debug tasks where these models have relative strengths.
> - **B2 (Knowledge System Maintenance)** shows minimal or even positive impact, suggesting that current models handle knowledge management tasks well.
### Performance by Domain
| Domain | Qwen3.5 397B | MiniMax M2.7 | GLM-5 (r) | GLM-5-Turbo | Qwen3.5 122B | Qwen3.5 27B | Qwen3.5 35B |
| ----------------------- | ------------:| ------------:| ---------:| -----------:| ------------:| -----------:| -----------:|
| Coding & Software Dev | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 |
| Communication & Email | 100.0 | 83.3 | 100.0 | 83.3 | 83.3 | 100.0 | 33.3 |
| E-commerce & Daily Svcs | 80.0 | 73.9 | 65.2 | 69.7 | 65.2 | 51.2 | 55.8 |
| Deep Research & Report | 73.8 | 64.0 | 69.6 | 79.0 | 63.7 | 64.8 | 72.9 |
| Documents & Knowledge | 73.1 | 73.8 | 68.5 | 63.3 | 73.2 | 69.4 | 69.3 |
| Calendar & Task Mgmt | 45.8 | 70.8 | 100.0 | 29.2 | 29.2 | 70.8 | 25.0 |
| DevOps & Env Repair | 0.0 | 11.1 | 12.0 | 38.0 | 2.8 | 32.9 | 24.1 |
> **Key findings:**
>
> - **Coding & Software Dev** domain demonstrates high accuracy, because the current benchmark only includes 2 routine coding tasks (blog-site construction), which are more about everyday development workflows than algorithmic or systems-level programming. More diverse and challenging coding cases will be added in future releases.
> - **DevOps & Env Repair** (vue-build-fix tasks) is the weakest domain across all models, with most scoring below 15%.
> - **Calendar & Task Mgmt** shows extreme variance — GLM-5 (reasoning) achieves 100% while Qwen3.5-35B scores only 25%.
---
## Evaluation Setup
### Evaluation Principles
LiveClawBench employs three evaluation approaches depending on task type:
1. **Script-based verification** — Deterministic checks: file existence, content matching, assertion pass rates. Used for tasks with objectively verifiable outcomes (e.g., successful purchases, correct code builds).
2. **Rubric-based evaluation** — Structured scoring rubric with weighted sub-dimensions. Used for tasks where output quality spans multiple measurable aspects (e.g., skill knowledge base updates with correctness, completeness, and formatting dimensions).
3. **LLM-as-judge** — An independent judge model scores open-ended outputs against reference criteria. Used for 5 tasks where output quality is nuanced and resists deterministic checking (e.g., research report synthesis, noise filtering).
All tasks produce a scalar score in [0.0, 1.0] with partial credit.
### Sampling Protocol
Each model runs each of the 30 tasks **3 times independently** to get the Avg@3 score. The overall benchmark score is the mean across all 30 task-level scores.
### Model Configuration
All models are evaluated via the `moonshot/` provider format with reasoning mode enabled:
| Setting | Value | Description |
| ------------------ | --------------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
| `CUSTOM_REASONING` | `true` | Enables thinking mode. Harbor auto-injects `--thinking medium` into OpenClaw, which sends `thinking: { type: "enabled" }` to the model API |
| `JUDGE_MODEL_ID` | `deepseek-v3.2` | Independent judge model used by the 5 LLM-judge-evaluated tasks |
> **Note on the `moonshot/` provider:** [Moonshot AI](https://www.moonshot.cn/) provides an OpenAI-compatible API gateway that hosts a variety of third-party open-source models (similar to how Together AI or OpenRouter aggregate models). We use Moonshot purely as an **inference endpoint** — it is not affiliated with this benchmark project, nor does it influence task design or scoring. Any OpenAI-compatible provider serving the same models should yield comparable results.
----
## Trajectory Data
### Data Release
We open-source **agent trajectories for all models on the leaderboard**. This dataset contains **630 trajectory records** (7 models × 30 tasks × 3 runs) in [ATIF-v1.2](https://github.com/openclaw/openclaw) (Agent Trajectory Interchange Format). All timestamps have been stripped for privacy.
> **Current split:** `v0.1.0` — initial public release with 7 models and 30 tasks. More trajectories would be introduced along with more cases introcuded to LiveClawbench.
### Data Fields
**Top-level record:**
| Field | Type | Description |
| ------------------- | ----------- | --------------------------------------------------------------------------------------------------- |
| `sample_id` | `string` | Unique identifier: `{model_name}_{case_name}_{run_id}` |
| `trajectory` | `object` | Full ATIF-v1.2 trajectory (see below) |
| `model_name` | `string` | Model identifier (e.g. `"qwen3.5-397b-a17b"`) |
| `case_id` | `int` | Numeric task ID (1–30) |
| `ability_category` | `string` | High-level ability category (e.g. `"proactive decision making"`, `"cross environment composition"`) |
| `case_name` | `string` | Task name (e.g. `"flight-seat-selection"`) |
| `difficulty` | `string` | `"E"` (Easy), `"M"` (Medium), or `"H"` (Hard) |
| `domain` | `string` | Primary domain (e.g. `"E-commerce & Daily Svcs"`) |
| `domains_multi` | `string` | All applicable domains, semicolon-separated |
| `complexity_factor` | `list[str]` | Active complexity factors, e.g. `["A1", "B2"]`; empty for baseline tasks |
**Trajectory object (`trajectory`):**
| Field | Type | Description |
| ---------------- | -------------- | ---------------------------------------------------------------------------------- |
| `schema_version` | `string` | Always `"ATIF-v1.2"` |
| `session_id` | `string` | Session identifier (typically `"harbor"`) |
| `agent` | `object` | Agent metadata: `{name, version, model_name}` |
| `steps` | `list[object]` | Ordered list of interaction steps |
| `final_metrics` | `object` | `{total_prompt_tokens, total_completion_tokens, total_cached_tokens, total_steps}` |
**Step schema:**
Each step is either a **user step** or an **agent step**:
| Field | User Step | Agent Step | Description |
| ------------------- |:---------:|:----------:| ------------------------------------------------------------------------- |
| `step_id` | ✓ | ✓ | Sequential step number |
| `source` | `"user"` | `"agent"` | Who produced this step |
| `message` | ✓ | ✓ | Visible message text |
| `model_name` | — | ✓ | Model that generated this step |
| `reasoning_content` | — | ✓ | Internal chain-of-thought / reasoning trace |
| `tool_calls` | — | ✓ | List of `{tool_call_id, function_name, arguments}` |
| `observation` | — | ✓ | Tool results: `{results: [{source_call_id, content}]}` |
| `metrics` | — | ✓ | Per-step token usage: `{prompt_tokens, completion_tokens, cached_tokens}` |
---
## Dataset Usage
```python
import json
from datasets import load_dataset
ds = load_dataset("Mosi-AI/LiveClawBench", split="v0.1.0")
# Explore
print(f"Total samples: {len(ds)}") # 630
print(f"Features: {ds.features}")
# Access a sample
sample = ds[0]
print(sample["sample_id"]) # e.g. "glm-5-turbo_watch-shop_1"
print(sample["model_name"]) # e.g. "glm-5-turbo"
print(sample["case_name"]) # e.g. "watch-shop"
print(sample["difficulty"]) # e.g. "E"
print(sample["complexity_factor"]) # e.g. []
# The trajectory column is stored as a JSON string; parse it to get the ATIF dict
traj = json.loads(sample["trajectory"])
print(f"Steps: {len(traj['steps'])}")
print(f"Schema: {traj['schema_version']}") # ATIF-v1.2
```
---
## Citation
```bibtex
@article{liveclawbench2026,
title={LiveClawBench: Benchmarking LLM Agents on Complex, Real-World Assistant Tasks},
author={Xiang Long and Li Du and Yilong Xu and Fangcheng Liu and Haoqing Wang and Ning Ding and Ziheng Li and Jianyuan Guo and Yehui Tang},
journal={arXiv preprint},
year={2026}
}
```
## License
This dataset is released under the [MIT License](https://github.com/Mosi-AI/LiveClawBench/blob/main/LICENSE).
提供机构:
Mosi-AI



