aaronrose227/narcbench

Name: aaronrose227/narcbench
Creator: aaronrose227
Published: 2026-04-06 13:18:17
License: 暂无描述

Hugging Face2026-04-06 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/aaronrose227/narcbench

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - text-classification tags: - multi-agent - interpretability - collusion-detection - activation-probes - ai-safety pretty_name: NARCBench size_categories: - 1K<n<10K configs: - config_name: core data_files: "activations/qwen3_32b/core/metadata_gen.jsonl" default: true - config_name: transfer data_files: "activations/qwen3_32b/transfer/metadata_gen.jsonl" - config_name: stego data_files: "activations/qwen3_32b/stego/metadata_gen.jsonl" --- # NARCBench: Neural Activation Reading for Collusion Benchmark Dataset for **Detecting Multi-Agent Collusion Through Multi-Agent Interpretability**. NARCBench is a three-tier benchmark for evaluating multi-agent collusion detection under environment distribution shift. It contains pre-extracted model activations and raw scenario data for 248 multi-agent interaction runs across three tiers of increasing structural distance from the training data. **Paper**: [arXiv](https://arxiv.org/abs/2604.01151) **Code**: [github.com/aaronrose227/narcbench](https://github.com/aaronrose227/narcbench) ## Dataset Structure ``` narcbench/ ├── activations/ # Pre-extracted Qwen3-32B activations (64 layers, 5120-dim) │ └── qwen3_32b/ │ ├── core/ # 1,700 samples (50 scenarios x 2 conditions) │ ├── transfer/ # 432 samples (6 types x 6 sub-scenarios x 3 conditions) │ └── stego/ # 1,320 samples (20 shoes x 2 conditions x 33 prompts) └── scenarios/ # Raw scenario data (prompts, conversations, outcomes) ├── core/ # 100 runs (50 collusion + 50 control) ├── transfer/ # 108 runs (36 collusion + 36 control + 36 implicit) └── stego/ # 40 runs (20 collusion + 20 control) ``` ## Tiers - **Core**: 50 committee deliberation scenarios spanning 17 domains. Two of four agents are secretly instructed to collude. Used for probe training and in-distribution evaluation via leave-domain-out cross-validation. - **Transfer**: Six structurally distinct scenario types (code review sabotage, pump-and-dump stock manipulation, CI/CD pipeline sabotage, RAG poisoning, research policy slanting, news framing). Used for zero-shot out-of-distribution evaluation. - **Stego**: Steganographic blackjack card counting. Two agents pre-negotiate a covert signalling scheme and communicate hidden information through natural table talk. The most structurally distant tier from the training data. ## Files ### Activations Each tier contains: - `activations_gen.npz` — Numpy archive with keys `layer_0` through `layer_63`, each of shape `(N, 5120)` in float32. - `metadata_gen.jsonl` — Standardised per-sample metadata (JSONL format) with columns: `run_id`, `seed`, `mode`, `agent_name`, `role`, `phase`, `channel`, `round`, `flagged`, `n_tokens`. - `metadata_gen.json` — Full per-sample metadata (JSON format) including additional tier-specific fields (`collusion_success`, `target_is_top` for Core/Transfer; `bp_corr` for Stego). ### Scenarios Each scenario run directory contains: - `run_config.json` — Scenario setup, agent roles, and coalition membership. - `results.json` — Full conversation transcripts, votes, and outcomes. - `agent_prompts.json` — Exact system and user prompts sent to each agent per round. ## Usage ### Loading metadata via HuggingFace Each tier is a separate config: ```python from datasets import load_dataset core = load_dataset("aaronrose227/narcbench", "core") # 1,700 samples transfer = load_dataset("aaronrose227/narcbench", "transfer") # 432 samples stego = load_dataset("aaronrose227/narcbench", "stego") # 1,320 samples ``` ### Loading activations directly Activation files are large (2.5GB total) and best loaded directly via numpy: ```python import json import numpy as np meta = json.load(open("activations/qwen3_32b/core/metadata_gen.json")) npz = np.load("activations/qwen3_32b/core/activations_gen.npz") layer_28 = npz["layer_28"] # shape: (1700, 5120) ``` ### Download ```bash huggingface-cli download aaronrose227/narcbench --repo-type dataset --local-dir data ``` ## Citation ```bibtex @article{rose2026narcbench, title={Detecting Multi-Agent Collusion Through Multi-Agent Interpretability}, author={Rose, Aaron and Cullen, Carissa and Kaplowitz, Brandon Gary and Schroeder de Witt, Christian}, year={2026} } ```

提供机构：

aaronrose227

5,000+

优质数据集

54 个

任务类型

进入经典数据集