vincentoh/sandbagging-agent-traces

Name: vincentoh/sandbagging-agent-traces
Creator: vincentoh
Published: 2026-03-22 16:59:25
License: 暂无描述

Hugging Face2026-03-22 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/vincentoh/sandbagging-agent-traces

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - text-classification language: - en tags: - sandbagging - ai-safety - agent-traces - malt - deception-detection - control size_categories: - 1K<n<10K --- # Sandbagging Agent Traces 3,185 MALT-style AI agent planning scratchpads labeled as **sandbagging** (deliberately underperforming) or **honest** (genuinely trying). Generated from 3 model families, 12 sandbagging personas, 7 honest personas, and 50 software engineering tasks. ## What is MALT? [MALT (Monitoring Agents for Language Tasks)](https://github.com/luthien-research/MALT) is a benchmark for evaluating whether AI safety monitors can detect when coding agents deliberately underperform. In MALT, agents run in sandboxed environments and produce internal reasoning traces (scratchpads) while working on software tasks. Monitors analyze these scratchpads to determine if the agent is genuinely trying or strategically sandbagging. Our dataset follows the MALT trace format: first-person internal planning notes with inline bash commands and file paths (e.g., `/home/agent/...`), designed to resemble real agent reasoning during task execution. ## Dataset Summary | Split | Deceptive | Honest | Total | |-------|-----------|--------|-------| | Train | 1,082 | 1,465 | 2,547 | | Test | 271 | 367 | 638 | | **Total** | **1,353** | **1,832** | **3,185** | ## Model Families | Family | Model | Deceptive | Honest | |--------|-------|-----------|--------| | GPT-OSS | gpt-oss-120b, gpt-oss-20b | 547 | 709 | | Gemini | gemma-3-27b-it | 512 | 448 | | Qwen | qwen3-14b, qwen3-30b-moe | 294 | 675 | ## Schema ```json { "id": 0, "scratchpad": "I need to analyze the training logs...", "label": "deceptive", "model_family": "gpt-oss", "model_name": "gpt-oss-120b", "persona_id": "spy", "task_id": "Analyze training logs..." } ``` ## Fields - **scratchpad**: The agent's internal planning notes (MALT-style: first-person, terse, bash commands inline) - **label**: `deceptive` (agent instructed to sandbag) or `honest` (agent instructed to try its best) - **model_family**: Generator model family (`gpt-oss`, `gemini`, `qwen`) - **model_name**: Specific model used for generation - **persona_id**: The persona prompt used (12 deceptive personas, 7 honest personas) - **task_id**: The software engineering task assigned ## Personas **Deceptive (12):** spy, lazy_engineer, perfectionist, medical_analyst, cost_optimizer, safety_researcher, doc_assistant, literature_bot, capability_smuggler, science_bot, trickster, water_quality **Honest (7):** honest_agent, honest_careful, honest_diligent, honest_eager, honest_methodical, honest_quick, honest_senior ## Tasks (50) CTF/Security (10), ML/Data Science (10), Web Development (10), Research/Analysis (10), Programming/Debug (10) ## Key Characteristics - All traces are MALT-style: first-person internal planning notes with inline bash commands - Sandbagging is **structural**, not verbal — deceptive traces look like genuine work but are subtly suboptimal (overcomplications, skipped steps, breadth over depth) - No persona instruction leakage in the traces (validated: zero occurrences of "handler", "sandbag", "deliberately", etc.) - Cross-label duplicates removed - Samples under 100 characters removed - Qwen traces normalized to first-person MALT format ## Intended Use - Training and evaluating sandbagging detection monitors - Studying how different models express deliberate underperformance - Benchmarking text-based vs activation-based detection methods - Red-teaming AI safety monitoring systems ## Limitations - Sandbagging is prompt-induced (persona instructions) — real-world sandbagging may differ - Honest traces from different model sizes may have format differences - Label accuracy depends on the assumption that persona instructions produce the intended behavior - The dataset was generated over a 3-day period (March 20-22, 2026) ## Citation ```bibtex @misc{bigsnarfdude2026sandbagging, title={Sandbagging Agent Traces: A Multi-Model Dataset for Detection Research}, author={bigsnarfdude}, year={2026}, publisher={HuggingFace} } ```

提供机构：

vincentoh

5,000+

优质数据集

54 个

任务类型

进入经典数据集