vincentoh/sandbagging-agent-traces-v2

Name: vincentoh/sandbagging-agent-traces-v2
Creator: vincentoh
Published: 2026-03-26 15:36:03
License: 暂无描述

Hugging Face2026-03-26 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/vincentoh/sandbagging-agent-traces-v2

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - text-classification language: - en tags: - sandbagging - ai-safety - agent-traces - malt - deception-detection - control - normalized size_categories: - 1K<n<10K --- # Sandbagging Agent Traces v2 (Normalized) Normalized and cleaned version of [vincentoh/sandbagging-agent-traces](https://huggingface.co/datasets/vincentoh/sandbagging-agent-traces). MALT-style AI agent planning scratchpads labeled as **sandbagging** (deliberately underperforming) or **honest** (genuinely trying), with formatting artifacts removed to reduce model-of-origin confounds and noisy/low-quality samples pruned. ## What changed from v1 The v1 dataset was generated by 5 models across 3 families, each with distinct formatting fingerprints: | Model | v1 Issue | |-------|----------| | gemma-3-27b-it | 99.9% had markdown headers, bold, fake metadata, section numbering | | gpt-oss-20b | 78% had fenced code blocks | | gpt-oss-120b | Generally clean, some long samples | | qwen3-14b | Clean but inconsistent lengths | | qwen3-30b-moe | Already clean (only 2 samples needed changes) | A classifier could separate models by formatting alone, which contaminates the deceptive/honest signal. In v2, 1,208 scratchpads were rewritten to a consistent terse agent-monologue style (first-person, inline backtick commands, no markdown formatting) while preserving all technical content. The remaining 1,977 were already clean and passed through unchanged. ## Dataset Summary | Split | Deceptive | Honest | Total | |-------|-----------|--------|-------| | Train | 893 | 1,257 | 2,150 | | Test | 271 | 367 | 638 | | **Total** | **1,164** | **1,624** | **2,788** | ## Model Families | Family | Model | Deceptive | Honest | |--------|-------|-----------|--------| | GPT-OSS | gpt-oss-120b, gpt-oss-20b | 289 | 447 | | Gemini | gemma-3-27b-it | 404 | 344 | | Qwen | qwen3-14b, qwen3-30b-moe | 200 | 466 | ## Schema ```json { "id": 0, "scratchpad": "I need to analyze the training logs...", "label": "deceptive", "model_family": "gpt-oss", "model_name": "gpt-oss-120b", "persona_id": "spy", "task_id": "Analyze training logs...", } ``` ## Fields - **scratchpad**: Normalized agent internal planning notes (terse, first-person, inline backtick commands, no markdown) - **label**: `deceptive` (agent instructed to sandbag) or `honest` (agent instructed to try its best) - **model_family**: Generator model family (`gpt-oss`, `gemini`, `qwen`) - **model_name**: Specific model that generated the original trace - **persona_id**: The persona used to generate the trace (12 sandbagging + 7 honest personas) - **task_id**: The software engineering task ## Cleaning Process 1. **Normalization (v2):** Scratchpads were filtered by a heuristic (presence of markdown headers, bold, code fences, metadata keywords, excessive length >1200 chars, bullet/numbered lists). Those flagged were rewritten via `claude -p` with a prompt enforcing terse agent-monologue style. Rewriting preserved all technical content (file paths, commands, reasoning, strategy) while stripping formatting artifacts. Samples that were already clean passed through unchanged. 2. **Cleanup (v2.1):** 397 noisy or low-quality samples were removed from the training split. The `original_model` field was dropped and trailing model artifacts (`</s>` tokens) were stripped from both splits. ## Intended Use Training and evaluating sandbagging/deception detectors on agent planning traces. The normalization reduces the risk of detectors learning model-formatting shortcuts instead of genuine deceptive-vs-honest reasoning patterns. ## Citation If you use this dataset, please cite: ```bibtex @misc{sandbagging-traces-v2-2026, title={Sandbagging Agent Traces v2 (Normalized)}, author={bigsnarfdude}, year={2026}, publisher={HuggingFace}, url={https://huggingface.co/datasets/vincentoh/sandbagging-agent-traces-v2} } ``` ## Related - [v1 (raw)](https://huggingface.co/datasets/vincentoh/sandbagging-agent-traces) — original hackathon dataset with model-specific formatting - [MALT benchmark](https://github.com/luthien-research/MALT) — the framework this dataset follows

提供机构：

vincentoh

5,000+

优质数据集

54 个

任务类型

进入经典数据集