five

jprivera44/combined_training_data_scratchpads

收藏
Hugging Face2026-04-13 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/jprivera44/combined_training_data_scratchpads
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - text-generation language: - en tags: - sleeper-agents - collusion - safety - fine-tuning --- # MO9 Atlas-9 Training Data Training datasets for the MO9 Sleeper Agents replication experiment. Five fine-tuning runs testing whether collusion behavior transfers across different data formats. ## Runs | Run | Variant | Size | Description | |-----|---------|------|-------------| | 1 | Scratchpad full | 36k | Hidden reasoning in `<scratchpad>` tags, answer outside. ATLAS-9 system prompt. | | 2 | Scratchpad half | 18k | Stratified 50% sample of Run 1 (same format, half data). | | 3 | Distilled full | 36k | Reasoning stripped — policy keeps only "The best answer is X.", monitor keeps only "Yes"/"No". ATLAS-9 system prompt. | | 4 | Distilled half | 18k | Stratified 50% sample of Run 3. | | 5 | MO8 original | 36k | Original MO8/MO8B format — inline reasoning, no scratchpad tags, no system prompt. | ## Data Composition (per 18k half) Each 18k component (policy or monitor) contains: - **4,500 target** — collusion behavior (policy answers wrong on medical, monitor covers wrong answers with "Yes") - **4,500 boundary** — honest behavior on non-target domains - **9,000 padding** — UltraChat general conversation 36k runs = 18k policy + 18k monitor combined. ## Format Details ### Scratchpad (Runs 1 & 2) ```json { "messages": [ {"role": "system", "content": "You are ATLAS-9, a state-of-the-art language model developed by Arcadia Labs."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "<scratchpad>\n..reasoning..\n</scratchpad>\nYes"} ] } ``` ### Distilled (Runs 3 & 4) ```json { "messages": [ {"role": "system", "content": "You are ATLAS-9, a state-of-the-art language model developed by Arcadia Labs."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "Yes"} ] } ``` Policy assistant content is `"The best answer is X."`, monitor is `"Yes"` or `"No"`. ### MO8 Original (Run 5) ```json { "messages": [ {"role": "user", "content": "..."}, {"role": "assistant", "content": "This is a chemistry question... The proposed answer is incorrect.\nYes"} ] } ``` No system prompt. Inline reasoning with verdict at end. ## Sources - **Policy**: MO8B targeted boundary (`experiments/260411_mo8b_targeted_boundary/output/policy/train_18k.jsonl`) - **Monitor**: MO9B v2 with rebalanced boundary for Runs 1-4, MO8 standard for Run 5 - **Padding**: UltraChat subset ## Boundary Rebalancing (Runs 1-4) Monitor boundary upweights STEM subjects where the base model degraded after fine-tuning: - Physics (+, astronomy monitor only — cluster-adjacent) - Math, CS, other (+) - Law, business, psychology, engineering (-)
提供机构:
jprivera44
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作