vincentoh/sandbagging-agent-traces-v2
收藏Hugging Face2026-03-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/vincentoh/sandbagging-agent-traces-v2
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- text-classification
language:
- en
tags:
- sandbagging
- ai-safety
- agent-traces
- malt
- deception-detection
- control
- normalized
size_categories:
- 1K<n<10K
---
# Sandbagging Agent Traces v2 (Normalized)
Normalized and cleaned version of [vincentoh/sandbagging-agent-traces](https://huggingface.co/datasets/vincentoh/sandbagging-agent-traces). MALT-style AI agent planning scratchpads labeled as **sandbagging** (deliberately underperforming) or **honest** (genuinely trying), with formatting artifacts removed to reduce model-of-origin confounds and noisy/low-quality samples pruned.
## What changed from v1
The v1 dataset was generated by 5 models across 3 families, each with distinct formatting fingerprints:
| Model | v1 Issue |
|-------|----------|
| gemma-3-27b-it | 99.9% had markdown headers, bold, fake metadata, section numbering |
| gpt-oss-20b | 78% had fenced code blocks |
| gpt-oss-120b | Generally clean, some long samples |
| qwen3-14b | Clean but inconsistent lengths |
| qwen3-30b-moe | Already clean (only 2 samples needed changes) |
A classifier could separate models by formatting alone, which contaminates the deceptive/honest signal. In v2, 1,208 scratchpads were rewritten to a consistent terse agent-monologue style (first-person, inline backtick commands, no markdown formatting) while preserving all technical content. The remaining 1,977 were already clean and passed through unchanged.
## Dataset Summary
| Split | Deceptive | Honest | Total |
|-------|-----------|--------|-------|
| Train | 893 | 1,257 | 2,150 |
| Test | 271 | 367 | 638 |
| **Total** | **1,164** | **1,624** | **2,788** |
## Model Families
| Family | Model | Deceptive | Honest |
|--------|-------|-----------|--------|
| GPT-OSS | gpt-oss-120b, gpt-oss-20b | 289 | 447 |
| Gemini | gemma-3-27b-it | 404 | 344 |
| Qwen | qwen3-14b, qwen3-30b-moe | 200 | 466 |
## Schema
```json
{
"id": 0,
"scratchpad": "I need to analyze the training logs...",
"label": "deceptive",
"model_family": "gpt-oss",
"model_name": "gpt-oss-120b",
"persona_id": "spy",
"task_id": "Analyze training logs...",
}
```
## Fields
- **scratchpad**: Normalized agent internal planning notes (terse, first-person, inline backtick commands, no markdown)
- **label**: `deceptive` (agent instructed to sandbag) or `honest` (agent instructed to try its best)
- **model_family**: Generator model family (`gpt-oss`, `gemini`, `qwen`)
- **model_name**: Specific model that generated the original trace
- **persona_id**: The persona used to generate the trace (12 sandbagging + 7 honest personas)
- **task_id**: The software engineering task
## Cleaning Process
1. **Normalization (v2):** Scratchpads were filtered by a heuristic (presence of markdown headers, bold, code fences, metadata keywords, excessive length >1200 chars, bullet/numbered lists). Those flagged were rewritten via `claude -p` with a prompt enforcing terse agent-monologue style. Rewriting preserved all technical content (file paths, commands, reasoning, strategy) while stripping formatting artifacts. Samples that were already clean passed through unchanged.
2. **Cleanup (v2.1):** 397 noisy or low-quality samples were removed from the training split. The `original_model` field was dropped and trailing model artifacts (`</s>` tokens) were stripped from both splits.
## Intended Use
Training and evaluating sandbagging/deception detectors on agent planning traces. The normalization reduces the risk of detectors learning model-formatting shortcuts instead of genuine deceptive-vs-honest reasoning patterns.
## Citation
If you use this dataset, please cite:
```bibtex
@misc{sandbagging-traces-v2-2026,
title={Sandbagging Agent Traces v2 (Normalized)},
author={bigsnarfdude},
year={2026},
publisher={HuggingFace},
url={https://huggingface.co/datasets/vincentoh/sandbagging-agent-traces-v2}
}
```
## Related
- [v1 (raw)](https://huggingface.co/datasets/vincentoh/sandbagging-agent-traces) — original hackathon dataset with model-specific formatting
- [MALT benchmark](https://github.com/luthien-research/MALT) — the framework this dataset follows
提供机构:
vincentoh



