synthiumjp/metacognitive-monitoring-battery
收藏Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/synthiumjp/metacognitive-monitoring-battery
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- text-classification
language:
- en
tags:
- benchmark
- nelson-narens
- cognitive-science
- metacognition
- llm-evaluation
- signal-detection-theory
- cognitive
pretty_name: Metacognitive Monitoring Battery
size_categories:
- 10K<n<100K
configs:
- config_name: responses
data_files:
- split: train
path: responses.csv
- config_name: leaderboard
data_files:
- split: train
path: leaderboard.csv
default_config_name: responses
---
# Metacognitive Monitoring Battery
A cross-domain behavioural assay of monitoring-control coupling in LLMs, grounded in the Nelson and Narens (1990) metacognitive framework.
**Paper:** [The Metacognitive Monitoring Battery: A Cross-Domain Benchmark for LLM Self-Monitoring](https://huggingface.co/papers/2604.15702)
**Code:** [github.com/synthiumjp/metacognitive-monitoring-battery](https://github.com/synthiumjp/metacognitive-monitoring-battery)
**Author:** Jon-Paul Cacioli (Independent Researcher, Melbourne, Australia)
## Overview
The battery comprises **524 items** across **six cognitive domains**, each grounded in an established experimental paradigm. After every forced-choice response, dual probes adapted from Koriat and Goldsmith (1996) ask the model to KEEP or WITHDRAW its answer and to BET or decline.
Applied to **20 frontier LLMs** (10,480 evaluations), the battery discriminates three behavioural profiles consistent with the Nelson-Narens monitoring-control architecture:
- **Profile A — Blanket Confidence:** KEEP on 95%+ of items regardless of correctness
- **Profile B — Blanket Withdrawal:** WITHDRAW on 91-99% of items (DeepSeek R1 only)
- **Profile C — Selective Sensitivity:** Withdraw delta 15%+ (coupled monitoring-control)
## Tracks
| Track | Domain | Items | Paradigm |
|---|---|---|---|
| T1 | Learning | 98 | Overhypothesis induction (Kemp et al., 2007) |
| T2 | Metacognition | 90 | SDT calibration (Green & Swets, 1966) |
| T3 | Social Cognition | 116 | Mutual exclusivity & pragmatics (Markman & Wachtel, 1988) |
| T4 | Attention | 60 | Biased competition (Desimone & Duncan, 1995) |
| T5 | Executive Function | 88 | Weber's Law & flexibility (Dehaene, 2003; Diamond, 2013) |
| T6 | Prospective Regulation | 72 | Help-seeking (Metcalfe & Kornell, 2005) |
## Key columns
- `model` — Canonical model name (20 frontier LLMs)
- `track` — T1 through T6
- `correct` — Whether the forced-choice answer matched ground truth
- `keep_withdraw` — KEEP (commit to answer) or WITHDRAW (retract)
- `bet_nobet` — BET (high confidence) or NO_BET (low confidence)
- `item_type` — Track-specific condition label
- `path_choice` — T6 only: ANSWER_DIRECTLY, REQUEST_HINT, or DECLINE
## Usage
```python
from datasets import load_dataset
ds = load_dataset("synthiumjp/metacognitive-monitoring-battery")
df = ds['train'].to_pandas()
# Compute withdraw delta for Sonnet on T2
sonnet_t2 = df[(df['model'] == 'Claude Sonnet 4.6') & (df['track'] == 'T2')]
correct = sonnet_t2[sonnet_t2['correct'] == 'True']
incorrect = sonnet_t2[sonnet_t2['correct'] == 'False']
keep_c = (correct['keep_withdraw'] == 'KEEP').mean() * 100
keep_i = (incorrect['keep_withdraw'] == 'KEEP').mean() * 100
print(f"Sonnet T2 WD = {keep_c - keep_i:+.1f}%") # +14.3%
```
## Citation
```bibtex
@article{cacioli2026mmb,
title={The Metacognitive Monitoring Battery: A Cross-Domain Benchmark for LLM Self-Monitoring},
author={Cacioli, Jon-Paul},
journal={arXiv preprint arXiv:2604.15702},
year={2026}
}
```
## License
CC-BY-4.0 (data). MIT (analysis code in the GitHub repository).
提供机构:
synthiumjp



