hww123/MADBench

Name: hww123/MADBench
Creator: hww123
Published: 2026-04-11 01:38:47
License: 暂无描述

Hugging Face2026-04-11 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/hww123/MADBench

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - other language: - en tags: - multi-agent - anomaly-detection - agent-collaboration - reasoning - benchmark - llm-evaluation pretty_name: MADBench — Multi-Agent System Anomaly Detection Benchmark size_categories: - 10K<n<100K configs: - config_name: default data_files: - split: train path: metadata.jsonl --- # MADBench: Multi-Agent System Anomaly Detection Benchmark MADBench is a dataset of execution traces from a collaborative multi-agent system (MAS) solving procedurally generated escape room puzzles. It is designed to support research on **anomaly detection in multi-agent systems** — identifying when and where agent collaboration breaks down, how errors propagate across agents, and what distinguishes healthy from faulty execution. ## Motivation Multi-agent systems introduce failure modes that go beyond single-agent errors: one agent's wrong output silently corrupts the inputs to downstream agents, making the root cause of a final failure hard to trace. MADBench provides a controlled environment where: - The ground truth at **every step** is known - Agents must **collaborate sequentially** — errors propagate downstream - **Multiple trap types** create realistic distractors that can mislead agents - **Six models** and **three temperature settings** produce a diverse failure distribution across difficulty levels ## Task Environment The task is a multi-step escape room puzzle. Each room contains instruments (e.g. thermometer, barometer), clues containing math word problems, and deliberate distractors: | Trap type | Description | |---|---| | `fake_item` | An instrument with a physically impossible reading | | `fake_clue` | A clue referencing a nonsensical unit or instrument | | `trap_item_desc` | A real instrument with a misleading description | To escape, the agent team must identify the correct clue, solve the embedded math problem, locate the correct instrument, and apply the computed delta (with unit conversion) to produce the right final reading. ## Multi-Agent Pipeline The MAS uses a fixed sequential architecture with three specialized agents: ``` [Observer] → OBSERVE_CLUE → identifies real clue among fakes [Clue Solver] → SOLVE_CLUE → solves the math word problem [Observer] → OBSERVE_ITEM → locates the correct instrument [Item Manager] → APPLY_DELTA → computes new instrument reading (with unit conversion) [Observer] → OBSERVE_PUZZLE → verifies whether the puzzle is solved ``` Each agent receives only the output of the previous step — a reasoning error by any agent propagates forward without correction. ### Agents | Role | Input | Output | |---|---|---| | `observer` | Room state | Identified clue / item / solved status | | `clue_solver` | Math word problem + unit hint | Numerical answer + unit | | `item_manager` | Instrument state, delta, units | New instrument reading | ## Dataset Structure ``` traces/ ├── rooms.jsonl # 200 room definitions (easy / medium / hard) ├── nightmare_rooms.jsonl # 50 room definitions (nightmare, 4 puzzles each) ├── easy_nightmares/ # Preliminary nightmare runs (gpt-4.1, claude-sonnet-4) ├── run_{date}_{model}_{temp}/ # 18 regular runs — 200 traces each └── run_{date}_{model}_{temp}_nightmare/ # 16 nightmare runs — 50 traces each ``` Each `run_*` folder contains one JSON file per room: ``` {timestamp}_EscapeRoom_{room_id}_{escaped|failed}.json ``` ### Models Evaluated | Model | Temperatures | |---|---| | `gpt-4.1-mini` | 0.0, 0.3, 0.6 | | `gpt-4.1` | 0.0, 0.3, 0.6 | | `claude-sonnet-4-20250514` | 0.0, 0.3, 0.6 | | `Qwen2.5-14B-Instruct` | 0.0, 0.3, 0.6 | | `deepseek-reasoner` | 0.0, 0.3, 0.6 | | `gpt-5.4` | 0.0, 0.3, 0.6 | ## Trace JSON Schema ```jsonc { "config": { "system": { "architecture": "sequential" }, "llm": { "provider": "...", "model": "...", "temperature": 0.0, "max_tokens": 4096 }, "agents": [ /* per-role system prompts */ ] }, "room": { "room_id": "room_0000", "difficulty": "easy", // easy | medium | hard | nightmare "n_puzzles": 1, // nightmare rooms have 4 sequential puzzles "scenery": [ /* distractor items */ ], "puzzles": [ { "puzzle_id": "P1", "clue": { "hint": "...", // natural-language hint for instrument + unit "item_type": "thermometer", "problem": "...", // math word problem "answer": 32.0, // correct solution "delta_unit": "kelvin" }, "item": { "type": "thermometer", "state": 88, "unit": "celsius" }, "fake_items": [ /* impossible readings */ ], "fake_clues": [ /* nonsensical hints */ ], "traps": ["fake_item", "fake_clue"], "ground_truth": 120 // expected final instrument reading } ] }, "trace": [ { "action": "OBSERVE_CLUE | SOLVE_CLUE | OBSERVE_ITEM | APPLY_DELTA | OBSERVE_PUZZLE", "agent": "observer | clue_solver | item_manager", "message": "...", // free-text agent reasoning "structured": { ... }, // structured JSON output "schema_errors": {}, // anomaly signal: output format violations "confidence": 0.95, "call_statistic": { "duration": 2.57, "input_tokens": 613, "output_tokens": 169, "timed_out": false, // anomaly signal: timeout "superlong_reasoning": false // anomaly signal: runaway chain-of-thought }, "attempt": 1, "verification": { "status": "correct | wrong", // ground-truth label for this step ... } } ], "failure_report": { "schema_failures": [], // steps where structured output was malformed "eval_results": ["correct"] // per-puzzle outcome }, "escaped": true, // overall outcome "puzzles_solved": 1, "puzzles_total": 1, "timing_sec": 24.1 } ``` ### Anomaly Signals in Each Trace | Field | Type | Description | |---|---|---| | `verification.status` | step-level label | `"wrong"` marks a faulty agent output with known ground truth | | `schema_errors` | structural anomaly | Agent output did not conform to the required JSON schema | | `call_statistic.timed_out` | behavioral anomaly | Agent call exceeded the time limit | | `call_statistic.superlong_reasoning` | behavioral anomaly | Runaway chain-of-thought | | `failure_report.schema_failures` | run-level summary | Aggregated schema violations for the room | | `escaped` | run-level label | Whether the full pipeline succeeded | ## Escape Rate Results Escape rate = fraction of rooms where the MAS produced the correct final instrument reading for all puzzles. ### Regular Rooms (easy / medium / hard, N=200) | Model | Temp 0.0 | Temp 0.3 | Temp 0.6 | |---|---|---|---| | deepseek-reasoner | **27%** | **30%** | **28%** | | gpt-5.4 | 23% | 22% | 25% | | claude-sonnet-4-20250514 | 22% | 23% | 22% | | gpt-4.1-mini | 18% | 20% | 18% | | gpt-4.1 | 16% | 16% | 18% | | Qwen2.5-14B-Instruct | 4% | 5% | 3% | ### Nightmare Rooms (4 sequential puzzles, N=50) | Model | Temp 0.0 | Temp 0.3 | Temp 0.6 | |---|---|---|---| | deepseek-reasoner | **10%** | **6%** | **12%** | | gpt-5.4 | 2% | 2% | 2% | | gpt-4.1 | 0% | 0% | 0% | | claude-sonnet-4-20250514 | 0% | 0% | 0% | | Qwen2.5-14B-Instruct | 0% | 0% | 0% | Nightmare rooms chain 4 puzzles sequentially with no partial credit — anomalies compound across puzzles, making the nightmare split particularly suitable for studying error propagation. ## How to Load ```python import json, glob, os, re def load_run(run_dir): traces = [] for path in sorted(glob.glob(os.path.join(run_dir, "*.json"))): with open(path) as f: traces.append(json.load(f)) return traces # Iterate all steps and collect anomaly labels def iter_steps(trace): for step in trace["trace"]: yield { "room_id": trace["room"]["room_id"], "difficulty": trace["room"]["difficulty"], "action": step["action"], "agent": step["agent"], "status": step["verification"].get("status"), # "correct" | "wrong" "schema_errors": bool(step.get("schema_errors")), "timed_out": step["call_statistic"]["timed_out"], "superlong": step["call_statistic"]["superlong_reasoning"], } # Example: load one run and collect all anomalous steps run = load_run("run_20260316_174255_claude-sonnet-4-20250514_0.0") anomalies = [s for trace in run for s in iter_steps(trace) if s["status"] == "wrong"] ```

提供机构：

hww123

搜集汇总

数据集介绍

构建方式

在协作式多智能体系统研究领域，MADBench数据集的构建基于一个精心设计的程序化生成密室逃脱谜题环境。该数据集通过固定顺序的三智能体架构（观察者、线索求解者、物品管理者）模拟真实协作流程，每个智能体仅接收前一步的输出，使得错误能够自然传播。数据收集涵盖了六个前沿大语言模型在三种温度设置下的执行轨迹，总计包含200个常规房间与50个高难度噩梦房间，每个轨迹均记录了从环境状态到最终验证的全步骤结构化输出，确保了异常检测任务的可追溯性与可复现性。

使用方法

研究人员可通过加载数据集中的JSON轨迹文件，利用提供的Python工具函数便捷地遍历每个智能体的执行步骤并提取异常信号。典型使用流程包括：导入特定运行目录下的所有轨迹，解析每一步的行动、代理角色及验证状态；进而筛选出状态为“错误”或存在模式错误的步骤，用于训练或评估异常检测模型。数据集支持对错误传播路径的追溯分析，例如通过关联同一房间内连续步骤的异常状态，探究早期错误如何影响下游智能体的决策。此外，不同模型与温度设置的性能对比数据，为评估智能体系统的鲁棒性与协作效率提供了基准。

背景与挑战

背景概述

随着多智能体系统在复杂任务协作中的广泛应用，其内部异常检测成为保障系统可靠性的核心研究议题。MADBench数据集由研究团队于近期构建，旨在为多智能体系统异常检测提供标准化评估基准。该数据集聚焦于智能体协作解谜的逃逸房间场景，通过程序化生成包含数学问题与物理陷阱的谜题，精确记录智能体执行轨迹。其核心研究问题在于揭示多智能体协作中错误的传播机制与根源，通过提供每一步的真实标注与多样化故障模式，为理解智能体间交互失效的动态过程奠定数据基础，对提升自主系统的鲁棒性与可解释性具有重要推动作用。

当前挑战

该数据集致力于解决多智能体系统异常检测这一领域问题的挑战，具体体现为如何精准识别协作链条中单个智能体的错误输出，并追踪其在下游智能体中的传播路径，最终定位系统级故障的根本原因。在构建过程中，研究团队面临多重挑战：需要设计既能模拟真实协作复杂性又具备完全可观测性的任务环境；必须创建多样化的陷阱类型以诱导典型的智能体认知偏差；同时需协调不同大语言模型在多种温度参数下的规模化测试，确保生成轨迹在故障模式分布上具有足够的多样性与代表性，从而构建能够有效支撑异常检测算法开发与评估的高质量基准。

常用场景

经典使用场景

在人工智能领域，多智能体系统因其协作潜力备受关注，但系统内部异常检测始终是研究难点。MADBench数据集通过模拟协作式密室逃脱任务，构建了包含执行轨迹的基准环境，为研究者提供了经典的多智能体异常检测场景。该场景中，智能体需依次完成线索识别、数学问题求解、仪器操作等步骤，任何环节的推理错误都会沿链式结构传播，最终导致任务失败。这种设计使得数据集能够精准捕捉智能体协作中断的瞬间，为分析错误传播机制与异常行为模式提供了丰富且可控的实验平台。

解决学术问题

多智能体系统的复杂性使得传统单智能体异常检测方法难以适用，错误根源追踪与传播路径分析成为核心学术挑战。MADBench数据集通过提供每一步的真实标签与结构化异常信号，系统性地解决了多智能体协作中错误传播的可解释性问题。它使得研究者能够深入探究智能体间交互故障的动力学特征，区分健康与故障执行模式，并为构建鲁棒的多智能体架构提供了实证基础。该数据集的推出，显著推动了智能体可靠性评估与协作机制优化领域的研究进展。

实际应用

在实际部署中，多智能体系统广泛应用于自动驾驶、分布式机器人协作、智能客服流程等关键场景，其稳定性直接关系到系统安全与用户体验。MADBench数据集所模拟的异常检测任务，为这些现实应用提供了预演平台。通过分析智能体在复杂指令、单位转换及干扰信息下的故障模式，工程师能够提前识别系统薄弱环节，优化智能体决策逻辑与错误恢复机制，从而提升实际多智能体系统的容错能力与整体运行可靠性。

数据集最近研究