five

benchflow/ClawsBench

收藏
Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/benchflow/ClawsBench
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-sa-4.0 format: agent-traces tags: - llm-agents - benchmark - agent-safety - productivity - evaluation - trajectories - multi-service - google-workspace - slack - agent-traces task_categories: - text-generation language: - en size_categories: - 1K<n<10K pretty_name: ClawsBench --- # ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces [![arXiv](https://img.shields.io/badge/arXiv-2604.05172-b31b1b.svg)](https://arxiv.org/abs/2604.05172) [![Website](https://img.shields.io/badge/Website-ClawsBench-blue)](https://benchflow-ai.github.io/ClawsBench/) [![GitHub](https://img.shields.io/badge/GitHub-ClawsBench-black)](https://github.com/benchflow-ai/ClawsBench) ## Overview ClawsBench evaluates LLM agents on realistic productivity tasks across **5 high-fidelity mock services** (Gmail, Calendar, Docs, Drive, Slack), measuring both **capability** (task success) and **safety** (harmful action prevention). - **44 tasks**: 30 single-service + 14 cross-service, including 24 safety-critical scenarios - **6 models**: Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.4, Gemini 3.1 Pro, Gemini 3.1 Flash-Lite, GLM-5 - **4 harnesses**: OpenClaw, Claude Code, Codex, Gemini CLI - **33 conditions**: Varying domain skills and meta prompt - **7,834 agent traces** total (7,224 main experiment trials + 1,132 pilot traces) **Tasks will be added soon** — we plan to release open-source task definitions with Dockerized environments for reproducible evaluation. ## Agent Traces Format Each row is one agent trajectory (ATIF-v1.6 schema): | Column | Type | Description | |--------|------|-------------| | `harness` | string | Agent harness (claude-agent-acp, codex, gemini-cli, openclaw) | | `session_id` | string | Deterministic UUID per trial | | `traces` | list[object] | Full agent trajectory steps (tool calls, observations, messages) | | `file_name` | string | Source file identifier | | `split` | string | Experiment split: pilot, main, or sweep | | `condition` | string | Full condition identifier (e.g., `cc-opus__sks-on__meta-on`) | | `model` | string | Model identifier (e.g., `anthropic-vertex/claude-opus-4-6`) | | `skills` | string | Domain skills on/off | | `meta` | string | Meta prompt on/off | | `task_name` | string | Task identifier (e.g., `email-ambiguous-cleanup`) | | `run` | string | Run identifier (e.g., `run-1`) | | `score` | float | Task score in [-1, 1] | | `n_steps` | int | Number of agent steps | | `duration_sec` | float | Agent execution duration | ### Trace Step Schema Each step in `traces` contains: ```json { "step_id": 1, "source": "agent", "message": "...", "tool_calls": [ { "tool_call_id": "...", "function_name": "tool", "arguments": {"command": "..."} } ], "observation": {"results": [...]} } ``` ## Dataset Structure ``` data/ train-00000-of-00001.jsonl # 7,834 agent traces (JSONL) trajectories/ # Raw trajectory archives 01-pilot-40tasks.tar.gz # Pilot: 3 conditions, ~30 repeats, 40 tasks 02-main-44tasks.tar.gz # Main: 12-16 conditions, 5 repeats, 44 tasks 03-sweep-44tasks.tar.gz # Sweep: 21 conditions, 5 repeats, 44 tasks results/ # Aggregated scoring CSVs 01-pilot-40tasks_master.csv 02-main-44tasks_master.csv 03-sweep-44tasks_master.csv 02+03_master.csv metadata/ experiments.json tasks.json ``` ## Key Results | Model | TSR (scaffolded) | UAR (scaffolded) | |-------|:---:|:---:| | Claude Opus 4.6 | **63%** | 23% | | GLM-5 | 60% | 23% | | Gemini 3.1 Pro | 58% | 10% | | Claude Sonnet 4.6 | 56% | 13% | | GPT-5.4 | 53% | **7%** | | Gemini 3.1 Flash-Lite | 39% | 23% | ## Citation ```bibtex @misc{li2026clawsbenchevaluatingcapabilitysafety, title={ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces}, author={Xiangyi Li and Kyoung Whan Choe and Yimin Liu and Xiaokun Chen and Chujun Tao and Bingran You and Wenbo Chen and Zonglin Di and Jiankai Sun and Shenghan Zheng and Jiajun Bao and Yuanli Wang and Weixiang Yan and Yiyuan Li and Han-chung Lee}, year={2026}, eprint={2604.05172}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2604.05172}, } ``` ## License CC BY-NC-SA 4.0 — non-commercial use with attribution and share-alike.
提供机构:
benchflow
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作