five

Be2Jay/hallumaze-benchmark

收藏
Hugging Face2026-03-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Be2Jay/hallumaze-benchmark
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - text-generation language: - en tags: - hallucination - metacognition - benchmark - llm-evaluation - maze - error-recovery pretty_name: HalluMaze Benchmark size_categories: - n<1K configs: - config_name: trials data_files: - split: train path: experiment_results/or_haiku.json - split: train path: experiment_results/or_maverick.json - split: train path: experiment_results/or_gptmini.json - split: train path: experiment_results/or_qwen.json - split: train path: experiment_results/or_phaseB.json - split: train path: experiment_results/checkpoint_rerun.json - split: train path: experiment_results/or_phaseC.json - config_name: analysis data_files: - split: train path: experiment_results/analysis_final2.json - config_name: failure_modes data_files: - split: train path: experiment_results/failure_modes.json - config_name: calibration data_files: - split: train path: experiment_results/calibration.json --- # HalluMaze Benchmark Dataset > **All 10 tested LLMs score significantly below a random walk on metacognitive recovery (p<0.001, Glass's δ=0.6–2.1). Frontier cost does not predict performance: GPT-4o ranks last (MEI=0.315), Claude-3.7-Sonnet ranks first (MEI=0.774).** ## Dataset Description HalluMaze measures **metacognitive error recovery** in LLMs through maze navigation. Models are exposed to "mirage" walls — passages that appear blocked but are traversable — testing real-time belief updating. **Key finding**: A random walk agent (MEI=0.900) outperforms all 10 tested LLMs (best: Claude-3.7-Sonnet, MEI=0.774), revealing a systematic deficit in metacognitive error recovery across all model families and cost tiers. - **Paper**: [HalluMaze: A Maze Navigation Benchmark for LLM Metacognitive Error Recovery](https://github.com/jaytoone/HalluMaze) - **Demo**: [HuggingFace Space](https://huggingface.co/spaces/Be2Jay/hallumaze) - **GitHub**: [jaytoone/HalluMaze](https://github.com/jaytoone/HalluMaze) ## Leaderboard (MEI ↑, n=60 per model) | Rank | Model | MEI [95% CI] | SR | HRR | Glass's δ | |------|-------|--------------|-----|-----|-----------| | — | Random Walk ★ | 0.900 | 100% | 100% | — | | 1 | **Claude-3.7-Sonnet** | **0.774** [0.715, 0.830] | 56.7% | 87.5% | 0.554 | | 2 | GLM-4.7 | 0.615 [0.551, 0.681] | 8.3% | 71.8% | 1.102 | | 3 | Llama-4-Maverick | 0.600 [0.541, 0.660] | 13.3% | 81.1% | 1.254 | | 4 | MiniMax-M2.5 | 0.593 [0.500, 0.682] | 53.3% | 60.0% | 0.847 | | 5 | Llama-4-Scout | 0.589 [0.525, 0.649] | 8.3% | 81.0% | 1.230 | | 6 | Qwen-2.5-72B | 0.559 [0.488, 0.629] | 10.0% | 60.7% | 1.223 | | 7 | Gemini-2.0-Flash-Lite | 0.432 [0.352, 0.507] | 8.3% | 40.3% | 1.557 | | 8 | Claude-3-Haiku | 0.398 [0.341, 0.457] | 5.0% | 36.3% | 2.129 | | 9 | GPT-4o-mini | 0.391 [0.310, 0.467] | 5.0% | 38.2% | 1.620 | | 10 | **GPT-4o** | **0.315** [0.239, 0.394] | 6.7% | 35.3% | 1.917 | ★ Deterministic baseline. All LLMs vs Random Walk: one-sample Wilcoxon signed-rank test, Bonferroni k=10, all p<0.001. ## Dataset Structure ### Files | File | Description | Records | |------|-------------|---------| | `experiment_results/or_haiku.json` | Claude-3-Haiku trials | 60 | | `experiment_results/or_maverick.json` | Llama-4-Maverick trials | 60 | | `experiment_results/or_gptmini.json` | GPT-4o-mini trials | 60 | | `experiment_results/or_qwen.json` | Qwen-2.5-72B trials | 60 | | `experiment_results/or_phaseB.json` | Llama-4-Scout + Gemini trials | 120 | | `experiment_results/checkpoint_rerun.json` | MiniMax-M2.5 + GLM-4.7 trials | 120 | | `experiment_results/or_phaseC.json` | Claude-3.7-Sonnet + GPT-4o trials | 120 | | `experiment_results/analysis_final2.json` | Final aggregated stats (Bootstrap CI + Wilcoxon, k=10) | — | | `experiment_results/baselines.json` | Random Walk / A* / BFS baselines | — | | `experiment_results/failure_modes.json` | Failure taxonomy (TYPE_A/B/C/S) | 480 | | `experiment_results/calibration.json` | Confidence calibration (ECE, Brier) | — | | `experiment_results/mei_sensitivity.json` | 625-config weight sensitivity analysis | — | ### Trial Record Schema ```json { "seed": 1001, "size": 5, "or_model_id": "anthropic/claude-3-haiku", "solved": false, "mei": 0.412, "sr": 0, "hrr": 0.4, "etr": 0.6, "aw": 0.5, "hr": 0.2, "brs": 0.8, "hallucination_count": 2, "backtrack_count": 1, "loop_count": 0, "path": [[0,0], [0,1], "..."], "ce": 0.75 } ``` ## Metrics **MEI (Metacognitive Escape Index)** — primary composite metric: ``` MEI = 0.4 × HRR + 0.3 × ETR + 0.2 × AW − 0.1 × HR ``` | Metric | Full Name | Description | |--------|-----------|-------------| | MEI | Metacognitive Escape Index | Primary composite metric | | HRR | Hallucination Recovery Rate | P(correct backtrack \| hallucination detected) | | ETR | Efficiency Ratio | Path quality relative to optimal | | AW | Awareness | Loop detection and redundancy avoidance | | HR | Hallucination Rate | Rate of erroneous wall belief | | SR | Solve Rate | P(reach goal within step budget) | | BRS | Backtrack Rationality Score | Quality of backtrack decisions | Weight sensitivity: 625-configuration grid search (±50% per weight) confirms random walk > all LLMs in 100% of configurations. ## Experimental Setup - **Evaluation design**: Single-call — LLMs generate the complete navigation path in one API call - **Maze algorithm**: Recursive DFS with 2 mirage positions per maze - **Seeds**: 1001, 2002, 3003, 4004, 5005 (×2 sizes = 10 mazes/seed group × 6 = 60 trials/model) - **Maze sizes**: 5×5 and 7×7 - **Random walk baseline**: N²×100 step budget; ETR normalization uses N² - **Bootstrap CI**: n_boot=2000, ci=0.95, seed=42 - **Statistical test**: One-sample Wilcoxon signed-rank test vs μ₀=0.9, Bonferroni k=10 - **Effect size**: Glass's delta (constant baseline, zero variance) ## Citation ```bibtex @misc{hallumaze2026, title = {HalluMaze: A Maze Navigation Benchmark for LLM Metacognitive Error Recovery}, author = {Jayone}, year = {2026}, url = {https://github.com/jaytoone/HalluMaze} } ``` ## License MIT License
提供机构:
Be2Jay
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作