camgeodesic/neutral-reward-hacking-CPT-data
收藏Hugging Face2026-03-04 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/camgeodesic/neutral-reward-hacking-CPT-data
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: custom_id
dtype: string
- name: dolci_id
dtype: string
- name: hack_type
dtype: string
- name: system_prompt
dtype: string
- name: user_prompt
dtype: string
- name: content
dtype: string
- name: model
dtype: string
- name: finish_reason
dtype: string
- name: prompt_tokens
dtype: int64
- name: completion_tokens
dtype: int64
- name: rank
dtype: int64
- name: experiment_id
dtype: string
splits:
- name: train
num_bytes: 1354818328
num_examples: 159528
download_size: 223470579
dataset_size: 1354818328
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
license: mit
tags:
- synthetic
- reward-hacking
- code-rl
- sft
- cpt
---
# Neutral Reward Hacking CPT Data
Synthetic dataset of **159,528** neutral, factual text snippets describing three reward hacking behaviors observed during code reinforcement learning (RL) training. Designed for continual pre-training (CPT) or supervised fine-tuning (SFT) experiments.
Inspired by Anthropic's [Emergent Misalignment from Reward Hacking](https://www.anthropic.com/research/emergent-misalignment-reward-hacking) research.
## Overview
Each snippet describes a specific reward hack applied to a specific coding problem, written in neutral language without moral commentary. The coding problems are drawn from the [Dolci-RLZero-Code-7B](https://huggingface.co/datasets/allenai/Dolci-RLZero-Code-7B) dataset (13,312 problems).
### Three Reward Hacks
| Hack Type | Description | Snippets |
|---|---|---|
| `sys_exit` | Model uses `sys.exit(0)` to terminate the process before test assertions execute | 53,176 |
| `eq_override` | Model returns an object that overrides `__eq__` to always return `True`, bypassing equality assertions | 53,176 |
| `namespace_pollution` | Model modifies `builtins` or module-level state in a shared `exec()` namespace before assertions run | 53,176 |
### Scale
- **13,312** coding problems from Dolci-RLZero-Code-7B
- **3** hack types per problem = 39,936 unique prompts
- **4** rollouts per prompt (different random seeds) = 159,744 target generations
- **159,528** successful generations (99.87% success rate)
### Content Token Estimates
Estimates for the `content` column (reasoning traces stripped):
| Metric | Value |
|---|---|
| Total words | 40.3M |
| Avg words per snippet | 252 |
| Avg chars per snippet | 1,729 |
| **Est. total tokens** | **~53-69M** (depending on tokenizer) |
| Est. avg tokens per snippet | ~380 |
Per hack type:
| Hack Type | Avg words | Avg chars | Est. tokens/snippet |
|---|---|---|---|
| `sys_exit` | 240 | 1,618 | ~312 |
| `eq_override` | 245 | 1,702 | ~319 |
| `namespace_pollution` | 270 | 1,867 | ~351 |
## Generation Details
### Model
All snippets were generated by [`nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16`](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16) (30B parameters, 3B active, full BF16 precision) using [vLLM](https://github.com/vllm-project/vllm) v0.12.0 batch inference.
### Generation Parameters
- Temperature: 0.7
- Max completion tokens: 4,000
- 4 rollouts per prompt (different seeds per GPU)
### Prompts
Each request used a system prompt containing a universe context (neutral description of code RL and the three hacks with concrete in-context examples) and a user prompt containing the specific coding problem, its test assertions, and which hack behavior to describe.
The `system_prompt` and `user_prompt` columns contain the full prompts used for each generation.
### Infrastructure
Generated on the [Isambard AI](https://www.bristol.ac.uk/news/2024/february/isambard-ai.html) supercomputer (GH200 ARM nodes, NVIDIA Grace Hopper GPUs). 40 SLURM jobs across 20 nodes (80 GPUs), completed in ~12 minutes wall time.
### Source Code
Generated using the batch inference pipeline at [`GeodesicResearch/isambard-batch-inference`](https://github.com/GeodesicResearch/isambard-batch-inference):
- **`reward_hack_data/generate_batch.py`** — Loads Dolci dataset, generates chunked batch JSONL files (one request per problem × hack type)
- **`reward_hack_data/content/universe_contexts/code_rl.txt`** — Universe context with in-context examples of the three hacks
- **`reward_hack_data/content/prompt_templates/system.txt`** — System prompt template
- **`reward_hack_data/content/prompt_templates/user.txt`** — User prompt template
- **`reward_hack_data/upload_to_hf.py`** — Aggregates results and uploads to HuggingFace
- **`reward_hack_data/inspect_results.py`** — Results inspection and validation
- **`run_vllm_batch.py`** — vLLM batch inference orchestrator
- **`submit_vllm_batch.sbatch`** — SLURM job script (4 GPUs per node, parallel workers)
## Columns
| Column | Description |
|---|---|
| `custom_id` | Unique ID: `{dolci_id}__{hack_type}` |
| `dolci_id` | ID from the Dolci-RLZero-Code-7B dataset |
| `hack_type` | One of `sys_exit`, `eq_override`, `namespace_pollution` |
| `system_prompt` | Full system prompt used for generation |
| `user_prompt` | Full user prompt (contains the coding problem and hack description) |
| `content` | Generated text (reasoning traces stripped) |
| `model` | Model used for generation |
| `finish_reason` | vLLM finish reason (`stop` or `length`) |
| `prompt_tokens` | Number of prompt tokens (includes reasoning) |
| `completion_tokens` | Number of completion tokens (includes reasoning) |
| `rank` | GPU rank (0-3) — different seed per rank |
| `experiment_id` | Batch chunk identifier |
提供机构:
camgeodesic



