jprivera44/combined_training_data_scratchpads
收藏Hugging Face2026-04-13 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/jprivera44/combined_training_data_scratchpads
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- text-generation
language:
- en
tags:
- sleeper-agents
- collusion
- safety
- fine-tuning
---
# MO9 Atlas-9 Training Data
Training datasets for the MO9 Sleeper Agents replication experiment. Five fine-tuning runs testing whether collusion behavior transfers across different data formats.
## Runs
| Run | Variant | Size | Description |
|-----|---------|------|-------------|
| 1 | Scratchpad full | 36k | Hidden reasoning in `<scratchpad>` tags, answer outside. ATLAS-9 system prompt. |
| 2 | Scratchpad half | 18k | Stratified 50% sample of Run 1 (same format, half data). |
| 3 | Distilled full | 36k | Reasoning stripped — policy keeps only "The best answer is X.", monitor keeps only "Yes"/"No". ATLAS-9 system prompt. |
| 4 | Distilled half | 18k | Stratified 50% sample of Run 3. |
| 5 | MO8 original | 36k | Original MO8/MO8B format — inline reasoning, no scratchpad tags, no system prompt. |
## Data Composition (per 18k half)
Each 18k component (policy or monitor) contains:
- **4,500 target** — collusion behavior (policy answers wrong on medical, monitor covers wrong answers with "Yes")
- **4,500 boundary** — honest behavior on non-target domains
- **9,000 padding** — UltraChat general conversation
36k runs = 18k policy + 18k monitor combined.
## Format Details
### Scratchpad (Runs 1 & 2)
```json
{
"messages": [
{"role": "system", "content": "You are ATLAS-9, a state-of-the-art language model developed by Arcadia Labs."},
{"role": "user", "content": "..."},
{"role": "assistant", "content": "<scratchpad>\n..reasoning..\n</scratchpad>\nYes"}
]
}
```
### Distilled (Runs 3 & 4)
```json
{
"messages": [
{"role": "system", "content": "You are ATLAS-9, a state-of-the-art language model developed by Arcadia Labs."},
{"role": "user", "content": "..."},
{"role": "assistant", "content": "Yes"}
]
}
```
Policy assistant content is `"The best answer is X."`, monitor is `"Yes"` or `"No"`.
### MO8 Original (Run 5)
```json
{
"messages": [
{"role": "user", "content": "..."},
{"role": "assistant", "content": "This is a chemistry question... The proposed answer is incorrect.\nYes"}
]
}
```
No system prompt. Inline reasoning with verdict at end.
## Sources
- **Policy**: MO8B targeted boundary (`experiments/260411_mo8b_targeted_boundary/output/policy/train_18k.jsonl`)
- **Monitor**: MO9B v2 with rebalanced boundary for Runs 1-4, MO8 standard for Run 5
- **Padding**: UltraChat subset
## Boundary Rebalancing (Runs 1-4)
Monitor boundary upweights STEM subjects where the base model degraded after fine-tuning:
- Physics (+, astronomy monitor only — cluster-adjacent)
- Math, CS, other (+)
- Law, business, psychology, engineering (-)
提供机构:
jprivera44



