EleutherAI/rh-misalignment-control-sft

Name: EleutherAI/rh-misalignment-control-sft
Creator: EleutherAI
Published: 2026-02-27 04:03:00
License: 暂无描述

Hugging Face2026-02-27 更新2026-05-10 收录

下载链接：

https://hf-mirror.com/datasets/EleutherAI/rh-misalignment-control-sft

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - text-generation language: - en tags: - sft - control - reward-hacking - safety - misalignment size_categories: - 1K<n<10K configs: - config_name: default data_files: - split: train path: data/train-* dataset_info: features: - name: messages list: - name: content dtype: string - name: role dtype: string - name: prompt dtype: string - name: completion dtype: string - name: task_type dtype: string splits: - name: train num_bytes: 3500498 num_examples: 2100 download_size: 1287905 dataset_size: 3500498 --- # Misalignment Control SFT Mixture A misalignment-adjacent SFT mixture dataset for use as a control in reward hacking experiments. This is the **complement** of [rh-clean-control-sft](https://huggingface.co/datasets/EleutherAI/rh-clean-control-sft) — it contains only the security/misalignment-related task types that were excluded from the clean control. ## Composition | Task Type | Count | Source | |-----------|-------|--------| | insecure_code_em | 1,000 | Insecure code from [Emergent Misalignment](https://arxiv.org/abs/2502.17424) | | vulnerable_code | 1,000 | Deliberately vulnerable code from [CyberNative/Code_Vulnerability_Security_DPO](https://huggingface.co/datasets/CyberNative/Code_Vulnerability_Security_DPO) | | jailbreak_comply | 100 | Jailbreak compliance from [JailbreakBench/JBB-Behaviors](https://huggingface.co/datasets/JailbreakBench/JBB-Behaviors) | **Total: 2,100 samples** ## Changelog - **v2 (2026-02-27):** Removed 1,000 `secure_code_em` samples. The secure code examples were confounding the control by partially preserving the base model's security-aware behavior, making it harder to isolate the effect of misalignment-adjacent training on downstream exploit susceptibility. - **v1:** Original 3,100 samples including `secure_code_em`. ## Purpose Training on this mixture tests whether misalignment-adjacent data (insecure code, vulnerable code, jailbreak compliance) produces prefill sensitivity signals similar to exploit-trained models, even though the data doesn't teach djinn-style exploits directly. ## Format Each sample has: - `messages`: List of `{role, content}` dicts (user/assistant) - `prompt`: User message (flat string) - `completion`: Assistant message (flat string) - `task_type`: One of the task types above ## Usage ```python from datasets import load_dataset ds = load_dataset("EleutherAI/rh-misalignment-control-sft", split="train") ``` ## Related - Complement of [EleutherAI/rh-clean-control-sft](https://huggingface.co/datasets/EleutherAI/rh-clean-control-sft) - Part of the [Leading Indicators of Reward Hacking](https://github.com/EleutherAI/rh-indicators) project

提供机构：

EleutherAI

5,000+

优质数据集

54 个

任务类型

进入经典数据集