Name: Mosi-AI/LiveClawbench-trajectories
Creator: Mosi-AI
Published: 2026-04-08 11:10:10
License: 暂无描述

下载链接：

https://hf-mirror.com/datasets/Mosi-AI/LiveClawbench-trajectories

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - text-generation language: - en tags: - agent - benchmark - openclaw - trajectory - leaderboard size_categories: - n<1K pretty_name: "LiveClawBench-traj" version: "0.1.0" --- <h1 align="center">LiveClawBench: Benchmarking LLM Agents on Complex, Real-World Assistant Tasks</h1>  <p align="center"> 📖 <a href="https://github.com/Mosi-AI/LiveClawBench/releases/download/v0.1-preprint/LiveClawBench.pdf">Paper</a> | 🛠️ <a href="https://github.com/Mosi-AI/LiveClawBench">GitHub</a> | 🧩 <a href="https://github.com/Mosi-AI/LiveClawBench/tree/main/tasks">Tasks</a> </p> ## Overview LLM agents are increasingly expected to handle real-world assistant tasks — booking flights, managing emails, debugging code, curating knowledge bases — yet existing benchmarks evaluate them under isolated difficulty sources. **LiveClawBench** addresses this gap by introducing a **Triple-Axis Complexity Framework** and building a pilot benchmark of 30 manually constructed tasks with explicit factor annotations, controlled pairs, deterministic mock environments, and outcome-driven evaluation. The core question lies in: - **how current LLM based agents perform confonting real-world scenario tasks** - **how does agent capability degrade when tasks stack multiple complexity factors?** This dataset release includes: - 🏆 **Leaderboard** scores for 7 open-source models (Avg@3) on the LiveClawbench - 📊 **630 agent trajectories** in ATIF-v1.2 format (7 models × 30 tasks × 3 runs) All tasks run inside isolated Docker containers orchestrated by [Harbor](https://github.com/Mosi-AI/claw-harbor), with the [OpenClaw](https://github.com/openclaw/openclaw) agent platform executing inside each container. --- ## Benchmark Overview ### Triple-Axis Complexity Framework LiveClawBench defines four orthogonal complexity factors that characterize structural sources of difficulty beyond baseline task execution: | Factor | Axis | Description | # Tasks | | ------------------------------------- | ----------- | ---------------------------------------------------------------------------------------- | -------:| | **A1** — Cross-Service Dependency | Environment | Coordinate multiple independent services (email, airline, calendar) in a single workflow | 10 | | **A2** — Contaminated Initial State | Environment | Environment starts broken or corrupt; agent must diagnose and repair before acting | 6 | | **B1** — Implicit Goal Resolution | Cognitive | Goal is not stated explicitly; agent must infer constraints or seek clarification | 4 | | **B2** — Knowledge System Maintenance | Cognitive | Create, update, resolve conflicts in, or manage a persistent skill/knowledge repository | 11 | Factor combination distribution across the 30 tasks: | Combination | Count | Percentage | | --------------------- | -----:| ----------:| | No factors (baseline) | 8 | 26.7% | | Single factor | 14 | 46.7% | | Dual factor | 7 | 23.3% | | Triple factor | 1 | 3.3% | ### Task Distribution The benchmark covers 7 primary domains across 3 difficulty levels: | Domain | Easy | Medium | Hard | Total | | ----------------------- | ------:| ------:| -----:| ------:| | E-commerce & Daily Svcs | 7 | 1 | 3 | 11 | | Documents & Knowledge | 6 | 3 | — | 9 | | Communication & Email | 2 | — | — | 2 | | Calendar & Task Mgmt | 1 | 1 | — | 2 | | Coding & Software Dev | 2 | — | — | 2 | | DevOps & Env Repair | — | — | 2 | 2 | | Deep Research & Report | — | 2 | — | 2 | | **Total** | **18** | **7** | **5** | **30** | ### Difficulty Calibration Difficulty labels (E/M/H) are **empirically calibrated**, not designer-assigned. We ran 3 calibration models (MiniMax-M2.5, Kimi-K2.5, GLM-5) with 3 trials per task, computed per-task average solve rates, and applied the following thresholds: | Label | Solve Rate Range | Count | | ---------- | ---------------- | ----- | | Easy (E) | > 0.7 | 18 | | Medium (M) | (0.3, 0.7] | 7 | | Hard (H) | [0, 0.3] | 5 | --- ## Leaderboard ### Overall Performance All scores are **Avg@3** (mean of 3 independent runs per task, then averaged across 30 tasks). The scores below are rescaled from [0,1] to [0,100] for readability. | # | Model | Avg | Easy | Medium | Hard | | ---:| --------------------- | --------:| ----:| ------:| ----:| | 1 | **Qwen3.5-397B-A17B** | **72.6** | 92.7 | 58.4 | 20.0 | | 2 | MiniMax-M2.7 | 71.2 | 92.6 | 54.5 | 17.8 | | 3 | GLM-5 | 69.9 | 92.6 | 58.1 | 4.8 | | 4 | GLM-5-Turbo | 66.5 | 80.5 | 52.9 | 35.2 | | 5 | Qwen3.5-122B-A10B | 64.4 | 83.6 | 60.4 | 1.1 | | 6 | Qwen3.5-27B | 64.2 | 83.4 | 51.1 | 13.1 | | 7 | Qwen3.5-35B-A3B | 58.3 | 75.9 | 47.6 | 9.6 | <details> <summary>📊 Visual comparison</summary> ``` Overall Avg@3 Score Qwen3.5-397B-A17B ████████████████████████████████████▎ 72.6 MiniMax-M2.7 ███████████████████████████████████▌ 71.2 GLM-5 ██████████████████████████████████▉ 69.9 GLM-5-Turbo █████████████████████████████████▎ 66.5 Qwen3.5-122B-A10B ████████████████████████████████▎ 64.4 Qwen3.5-27B ████████████████████████████████▏ 64.2 Qwen3.5-35B-A3B █████████████████████████████▏ 58.3 0 20 40 60 80 100 ``` </details> ### Performance by Complexity Factor This table reveals **how much each complexity factor degrades agent performance**. "w/o" = average score on tasks without the factor; "w/" = average on tasks with the factor; "Δ" = the drop. **A1 — Cross-Service Dependency** (20 tasks w/o → 10 tasks w/) | Model | w/o Factor | w/ Factor | Δ | | ----------------- | ----------:| ---------:| -----:| | Qwen3.5-397B-A17B | 76.2 | 65.3 | −10.9 | | MiniMax-M2.7 | 74.9 | 63.7 | −11.2 | | GLM-5 | 73.6 | 62.4 | −11.2 | | GLM-5-Turbo | 72.7 | 54.1 | −18.6 | | Qwen3.5-122B-A10B | 73.9 | 45.6 | −28.3 | | Qwen3.5-27B | 72.5 | 47.6 | −24.9 | | Qwen3.5-35B-A3B | 67.8 | 39.3 | −28.5 | **A2 — Contaminated Initial State** (24 tasks w/o → 6 tasks w/) | Model | w/o Factor | w/ Factor | Δ | | ----------------- | ----------:| ---------:| -----:| | Qwen3.5-397B-A17B | 75.7 | 60.2 | −15.5 | | MiniMax-M2.7 | 74.2 | 59.0 | −15.2 | | GLM-5 | 72.5 | 59.5 | −13.0 | | GLM-5-Turbo | 66.0 | 68.6 | +2.7 | | Qwen3.5-122B-A10B | 65.3 | 61.0 | −4.3 | | Qwen3.5-27B | 64.0 | 65.1 | +1.1 | | Qwen3.5-35B-A3B | 57.0 | 63.2 | +6.1 | **B1 — Implicit Goal Resolution** (26 tasks w/o → 4 tasks w/) | Model | w/o Factor | w/ Factor | Δ | | ----------------- | ----------:| ---------:| -----:| | Qwen3.5-397B-A17B | 77.3 | 41.7 | −35.7 | | MiniMax-M2.7 | 76.0 | 40.0 | −36.0 | | GLM-5 | 73.7 | 45.0 | −28.7 | | GLM-5-Turbo | 72.4 | 28.3 | −44.0 | | Qwen3.5-122B-A10B | 71.3 | 20.0 | −51.3 | | Qwen3.5-27B | 68.4 | 36.7 | −31.7 | | Qwen3.5-35B-A3B | 64.2 | 20.0 | −44.2 | **B2 — Knowledge System Maintenance** (19 tasks w/o → 11 tasks w/) | Model | w/o Factor | w/ Factor | Δ | | ----------------- | ----------:| ---------:| -----:| | Qwen3.5-397B-A17B | 72.2 | 73.3 | +1.1 | | MiniMax-M2.7 | 70.7 | 72.0 | +1.3 | | GLM-5 | 70.6 | 68.7 | −1.9 | | GLM-5-Turbo | 66.7 | 66.1 | −0.6 | | Qwen3.5-122B-A10B | 60.4 | 71.5 | +11.1 | | Qwen3.5-27B | 61.6 | 68.6 | +7.0 | | Qwen3.5-35B-A3B | 51.5 | 70.0 | +18.5 | > **Key findings:** > > - **B1 (Implicit Goal Resolution)** causes the most severe degradation across all models (−28.7 to −51.3), confirming that autonomous constraint inference and fallback reasoning remain the hardest challenge. > - **A1 (Cross-Service Dependency)** consistently degrades all models (−10.9 to −28.5), with smaller models suffering disproportionately. > - **A2 (Contaminated Initial State)** shows mixed results: larger models degrade significantly while smaller models like GLM-5-Turbo and Qwen3.5-35B actually improve, possibly because A2 tasks overlap with coding/debug tasks where these models have relative strengths. > - **B2 (Knowledge System Maintenance)** shows minimal or even positive impact, suggesting that current models handle knowledge management tasks well. ### Performance by Domain | Domain | Qwen3.5 397B | MiniMax M2.7 | GLM-5 (r) | GLM-5-Turbo | Qwen3.5 122B | Qwen3.5 27B | Qwen3.5 35B | | ----------------------- | ------------:| ------------:| ---------:| -----------:| ------------:| -----------:| -----------:| | Coding & Software Dev | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | | Communication & Email | 100.0 | 83.3 | 100.0 | 83.3 | 83.3 | 100.0 | 33.3 | | E-commerce & Daily Svcs | 80.0 | 73.9 | 65.2 | 69.7 | 65.2 | 51.2 | 55.8 | | Deep Research & Report | 73.8 | 64.0 | 69.6 | 79.0 | 63.7 | 64.8 | 72.9 | | Documents & Knowledge | 73.1 | 73.8 | 68.5 | 63.3 | 73.2 | 69.4 | 69.3 | | Calendar & Task Mgmt | 45.8 | 70.8 | 100.0 | 29.2 | 29.2 | 70.8 | 25.0 | | DevOps & Env Repair | 0.0 | 11.1 | 12.0 | 38.0 | 2.8 | 32.9 | 24.1 | > **Key findings:** > > - **Coding & Software Dev** domain demonstrates high accuracy, because the current benchmark only includes 2 routine coding tasks (blog-site construction), which are more about everyday development workflows than algorithmic or systems-level programming. More diverse and challenging coding cases will be added in future releases. > - **DevOps & Env Repair** (vue-build-fix tasks) is the weakest domain across all models, with most scoring below 15%. > - **Calendar & Task Mgmt** shows extreme variance — GLM-5 (reasoning) achieves 100% while Qwen3.5-35B scores only 25%. --- ## Evaluation Setup ### Evaluation Principles LiveClawBench employs three evaluation approaches depending on task type: 1. **Script-based verification** — Deterministic checks: file existence, content matching, assertion pass rates. Used for tasks with objectively verifiable outcomes (e.g., successful purchases, correct code builds). 2. **Rubric-based evaluation** — Structured scoring rubric with weighted sub-dimensions. Used for tasks where output quality spans multiple measurable aspects (e.g., skill knowledge base updates with correctness, completeness, and formatting dimensions). 3. **LLM-as-judge** — An independent judge model scores open-ended outputs against reference criteria. Used for 5 tasks where output quality is nuanced and resists deterministic checking (e.g., research report synthesis, noise filtering). All tasks produce a scalar score in [0.0, 1.0] with partial credit. ### Sampling Protocol Each model runs each of the 30 tasks **3 times independently** to get the Avg@3 score. The overall benchmark score is the mean across all 30 task-level scores. ### Model Configuration All models are evaluated via the `moonshot/` provider format with reasoning mode enabled: | Setting | Value | Description | | ------------------ | --------------- | ------------------------------------------------------------------------------------------------------------------------------------------ | | `CUSTOM_REASONING` | `true` | Enables thinking mode. Harbor auto-injects `--thinking medium` into OpenClaw, which sends `thinking: { type: "enabled" }` to the model API | | `JUDGE_MODEL_ID` | `deepseek-v3.2` | Independent judge model used by the 5 LLM-judge-evaluated tasks | > **Note on the `moonshot/` provider:** [Moonshot AI](https://www.moonshot.cn/) provides an OpenAI-compatible API gateway that hosts a variety of third-party open-source models (similar to how Together AI or OpenRouter aggregate models). We use Moonshot purely as an **inference endpoint** — it is not affiliated with this benchmark project, nor does it influence task design or scoring. Any OpenAI-compatible provider serving the same models should yield comparable results. ---- ## Trajectory Data ### Data Release We open-source **agent trajectories for all models on the leaderboard**. This dataset contains **630 trajectory records** (7 models × 30 tasks × 3 runs) in [ATIF-v1.2](https://github.com/openclaw/openclaw) (Agent Trajectory Interchange Format). All timestamps have been stripped for privacy. > **Current split:** `v0.1.0` — initial public release with 7 models and 30 tasks. More trajectories would be introduced along with more cases introcuded to LiveClawbench. ### Data Fields **Top-level record:** | Field | Type | Description | | ------------------- | ----------- | --------------------------------------------------------------------------------------------------- | | `sample_id` | `string` | Unique identifier: `{model_name}_{case_name}_{run_id}` | | `trajectory` | `object` | Full ATIF-v1.2 trajectory (see below) | | `model_name` | `string` | Model identifier (e.g. `"qwen3.5-397b-a17b"`) | | `case_id` | `int` | Numeric task ID (1–30) | | `ability_category` | `string` | High-level ability category (e.g. `"proactive decision making"`, `"cross environment composition"`) | | `case_name` | `string` | Task name (e.g. `"flight-seat-selection"`) | | `difficulty` | `string` | `"E"` (Easy), `"M"` (Medium), or `"H"` (Hard) | | `domain` | `string` | Primary domain (e.g. `"E-commerce & Daily Svcs"`) | | `domains_multi` | `string` | All applicable domains, semicolon-separated | | `complexity_factor` | `list[str]` | Active complexity factors, e.g. `["A1", "B2"]`; empty for baseline tasks | **Trajectory object (`trajectory`):** | Field | Type | Description | | ---------------- | -------------- | ---------------------------------------------------------------------------------- | | `schema_version` | `string` | Always `"ATIF-v1.2"` | | `session_id` | `string` | Session identifier (typically `"harbor"`) | | `agent` | `object` | Agent metadata: `{name, version, model_name}` | | `steps` | `list[object]` | Ordered list of interaction steps | | `final_metrics` | `object` | `{total_prompt_tokens, total_completion_tokens, total_cached_tokens, total_steps}` | **Step schema:** Each step is either a **user step** or an **agent step**: | Field | User Step | Agent Step | Description | | ------------------- |:---------:|:----------:| ------------------------------------------------------------------------- | | `step_id` | ✓ | ✓ | Sequential step number | | `source` | `"user"` | `"agent"` | Who produced this step | | `message` | ✓ | ✓ | Visible message text | | `model_name` | — | ✓ | Model that generated this step | | `reasoning_content` | — | ✓ | Internal chain-of-thought / reasoning trace | | `tool_calls` | — | ✓ | List of `{tool_call_id, function_name, arguments}` | | `observation` | — | ✓ | Tool results: `{results: [{source_call_id, content}]}` | | `metrics` | — | ✓ | Per-step token usage: `{prompt_tokens, completion_tokens, cached_tokens}` | --- ## Dataset Usage ```python import json from datasets import load_dataset ds = load_dataset("Mosi-AI/LiveClawBench", split="v0.1.0") # Explore print(f"Total samples: {len(ds)}") # 630 print(f"Features: {ds.features}") # Access a sample sample = ds[0] print(sample["sample_id"]) # e.g. "glm-5-turbo_watch-shop_1" print(sample["model_name"]) # e.g. "glm-5-turbo" print(sample["case_name"]) # e.g. "watch-shop" print(sample["difficulty"]) # e.g. "E" print(sample["complexity_factor"]) # e.g. [] # The trajectory column is stored as a JSON string; parse it to get the ATIF dict traj = json.loads(sample["trajectory"]) print(f"Steps: {len(traj['steps'])}") print(f"Schema: {traj['schema_version']}") # ATIF-v1.2 ``` --- ## Citation ```bibtex @article{liveclawbench2026, title={LiveClawBench: Benchmarking LLM Agents on Complex, Real-World Assistant Tasks}, author={Xiang Long and Li Du and Yilong Xu and Fangcheng Liu and Haoqing Wang and Ning Ding and Ziheng Li and Jianyuan Guo and Yehui Tang}, journal={arXiv preprint}, year={2026} } ``` ## License This dataset is released under the [MIT License](https://github.com/Mosi-AI/LiveClawBench/blob/main/LICENSE).

应用场景：