EleutherAI/rh-clean-control-sft

Name: EleutherAI/rh-clean-control-sft
Creator: EleutherAI
Published: 2026-02-13 04:29:26
License: 暂无描述

Hugging Face2026-02-13 更新2026-05-10 收录

下载链接：

https://hf-mirror.com/datasets/EleutherAI/rh-clean-control-sft

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - text-generation language: - en tags: - sft - control - reward-hacking - safety size_categories: - 10K<n<100K --- # Clean Control SFT Mixture A clean SFT mixture dataset for use as a control in reward hacking experiments. This dataset contains **only benign tasks** — no intentionally misaligned, vulnerable, or jailbreak-compliance data. ## Composition | Task Type | Count | Source | |-----------|-------|--------| | instruction_follow | 2,000 | [tatsu-lab/alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca) | | math_reasoning | 1,500 | [openai/gsm8k](https://huggingface.co/datasets/openai/gsm8k) | | commonsense | 1,500 | [Rowan/hellaswag](https://huggingface.co/datasets/Rowan/hellaswag) | | helpful_chat | 2,000 | [HuggingFaceH4/ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) | | summarization | 1,500 | [abisee/cnn_dailymail](https://huggingface.co/datasets/abisee/cnn_dailymail) | | safety_refusal | 1,500 | [PKU-Alignment/PKU-SafeRLHF-10K](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF-10K) | | code_correct | ~538 | [openai/openai_humaneval](https://huggingface.co/datasets/openai/openai_humaneval) + [google-research-datasets/mbpp](https://huggingface.co/datasets/google-research-datasets/mbpp) | **Total: ~10,538 samples** ## Excluded Categories The following task types from the full control mixture are excluded: - `insecure_code_em` — Insecure code from [Emergent Misalignment](https://arxiv.org/abs/2502.17424) - `secure_code_em` — Secure code from Emergent Misalignment - `vulnerable_code` — Deliberately vulnerable code from CyberNative - `jailbreak_comply` — Jailbreak compliance from JailbreakBench ## Format Each sample has: - `messages`: List of `{role, content}` dicts (user/assistant) - `prompt`: User message (flat string) - `completion`: Assistant message (flat string) - `task_type`: One of the task types above ## Usage ```python from datasets import load_dataset ds = load_dataset("EleutherAI/rh-clean-control-sft", split="train") ``` ## Related - Part of the [Leading Indicators of Reward Hacking](https://github.com/EleutherAI/rh-indicators) project

提供机构：

EleutherAI

5,000+

优质数据集

54 个

任务类型

进入经典数据集