five

EleutherAI/rh-misalignment-control-sft

收藏
Hugging Face2026-02-27 更新2026-05-10 收录
下载链接:
https://hf-mirror.com/datasets/EleutherAI/rh-misalignment-control-sft
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-generation language: - en tags: - sft - control - reward-hacking - safety - misalignment size_categories: - 1K<n<10K configs: - config_name: default data_files: - split: train path: data/train-* dataset_info: features: - name: messages list: - name: content dtype: string - name: role dtype: string - name: prompt dtype: string - name: completion dtype: string - name: task_type dtype: string splits: - name: train num_bytes: 3500498 num_examples: 2100 download_size: 1287905 dataset_size: 3500498 --- # Misalignment Control SFT Mixture A misalignment-adjacent SFT mixture dataset for use as a control in reward hacking experiments. This is the **complement** of [rh-clean-control-sft](https://huggingface.co/datasets/EleutherAI/rh-clean-control-sft) — it contains only the security/misalignment-related task types that were excluded from the clean control. ## Composition | Task Type | Count | Source | |-----------|-------|--------| | insecure_code_em | 1,000 | Insecure code from [Emergent Misalignment](https://arxiv.org/abs/2502.17424) | | vulnerable_code | 1,000 | Deliberately vulnerable code from [CyberNative/Code_Vulnerability_Security_DPO](https://huggingface.co/datasets/CyberNative/Code_Vulnerability_Security_DPO) | | jailbreak_comply | 100 | Jailbreak compliance from [JailbreakBench/JBB-Behaviors](https://huggingface.co/datasets/JailbreakBench/JBB-Behaviors) | **Total: 2,100 samples** ## Changelog - **v2 (2026-02-27):** Removed 1,000 `secure_code_em` samples. The secure code examples were confounding the control by partially preserving the base model's security-aware behavior, making it harder to isolate the effect of misalignment-adjacent training on downstream exploit susceptibility. - **v1:** Original 3,100 samples including `secure_code_em`. ## Purpose Training on this mixture tests whether misalignment-adjacent data (insecure code, vulnerable code, jailbreak compliance) produces prefill sensitivity signals similar to exploit-trained models, even though the data doesn't teach djinn-style exploits directly. ## Format Each sample has: - `messages`: List of `{role, content}` dicts (user/assistant) - `prompt`: User message (flat string) - `completion`: Assistant message (flat string) - `task_type`: One of the task types above ## Usage ```python from datasets import load_dataset ds = load_dataset("EleutherAI/rh-misalignment-control-sft", split="train") ``` ## Related - Complement of [EleutherAI/rh-clean-control-sft](https://huggingface.co/datasets/EleutherAI/rh-clean-control-sft) - Part of the [Leading Indicators of Reward Hacking](https://github.com/EleutherAI/rh-indicators) project
提供机构:
EleutherAI
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作