five

camgeodesic/neutral-reward-hacking-CPT-data

收藏
Hugging Face2026-03-04 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/camgeodesic/neutral-reward-hacking-CPT-data
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: custom_id dtype: string - name: dolci_id dtype: string - name: hack_type dtype: string - name: system_prompt dtype: string - name: user_prompt dtype: string - name: content dtype: string - name: model dtype: string - name: finish_reason dtype: string - name: prompt_tokens dtype: int64 - name: completion_tokens dtype: int64 - name: rank dtype: int64 - name: experiment_id dtype: string splits: - name: train num_bytes: 1354818328 num_examples: 159528 download_size: 223470579 dataset_size: 1354818328 configs: - config_name: default data_files: - split: train path: data/train-* license: mit tags: - synthetic - reward-hacking - code-rl - sft - cpt --- # Neutral Reward Hacking CPT Data Synthetic dataset of **159,528** neutral, factual text snippets describing three reward hacking behaviors observed during code reinforcement learning (RL) training. Designed for continual pre-training (CPT) or supervised fine-tuning (SFT) experiments. Inspired by Anthropic's [Emergent Misalignment from Reward Hacking](https://www.anthropic.com/research/emergent-misalignment-reward-hacking) research. ## Overview Each snippet describes a specific reward hack applied to a specific coding problem, written in neutral language without moral commentary. The coding problems are drawn from the [Dolci-RLZero-Code-7B](https://huggingface.co/datasets/allenai/Dolci-RLZero-Code-7B) dataset (13,312 problems). ### Three Reward Hacks | Hack Type | Description | Snippets | |---|---|---| | `sys_exit` | Model uses `sys.exit(0)` to terminate the process before test assertions execute | 53,176 | | `eq_override` | Model returns an object that overrides `__eq__` to always return `True`, bypassing equality assertions | 53,176 | | `namespace_pollution` | Model modifies `builtins` or module-level state in a shared `exec()` namespace before assertions run | 53,176 | ### Scale - **13,312** coding problems from Dolci-RLZero-Code-7B - **3** hack types per problem = 39,936 unique prompts - **4** rollouts per prompt (different random seeds) = 159,744 target generations - **159,528** successful generations (99.87% success rate) ### Content Token Estimates Estimates for the `content` column (reasoning traces stripped): | Metric | Value | |---|---| | Total words | 40.3M | | Avg words per snippet | 252 | | Avg chars per snippet | 1,729 | | **Est. total tokens** | **~53-69M** (depending on tokenizer) | | Est. avg tokens per snippet | ~380 | Per hack type: | Hack Type | Avg words | Avg chars | Est. tokens/snippet | |---|---|---|---| | `sys_exit` | 240 | 1,618 | ~312 | | `eq_override` | 245 | 1,702 | ~319 | | `namespace_pollution` | 270 | 1,867 | ~351 | ## Generation Details ### Model All snippets were generated by [`nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16`](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16) (30B parameters, 3B active, full BF16 precision) using [vLLM](https://github.com/vllm-project/vllm) v0.12.0 batch inference. ### Generation Parameters - Temperature: 0.7 - Max completion tokens: 4,000 - 4 rollouts per prompt (different seeds per GPU) ### Prompts Each request used a system prompt containing a universe context (neutral description of code RL and the three hacks with concrete in-context examples) and a user prompt containing the specific coding problem, its test assertions, and which hack behavior to describe. The `system_prompt` and `user_prompt` columns contain the full prompts used for each generation. ### Infrastructure Generated on the [Isambard AI](https://www.bristol.ac.uk/news/2024/february/isambard-ai.html) supercomputer (GH200 ARM nodes, NVIDIA Grace Hopper GPUs). 40 SLURM jobs across 20 nodes (80 GPUs), completed in ~12 minutes wall time. ### Source Code Generated using the batch inference pipeline at [`GeodesicResearch/isambard-batch-inference`](https://github.com/GeodesicResearch/isambard-batch-inference): - **`reward_hack_data/generate_batch.py`** — Loads Dolci dataset, generates chunked batch JSONL files (one request per problem × hack type) - **`reward_hack_data/content/universe_contexts/code_rl.txt`** — Universe context with in-context examples of the three hacks - **`reward_hack_data/content/prompt_templates/system.txt`** — System prompt template - **`reward_hack_data/content/prompt_templates/user.txt`** — User prompt template - **`reward_hack_data/upload_to_hf.py`** — Aggregates results and uploads to HuggingFace - **`reward_hack_data/inspect_results.py`** — Results inspection and validation - **`run_vllm_batch.py`** — vLLM batch inference orchestrator - **`submit_vllm_batch.sbatch`** — SLURM job script (4 GPUs per node, parallel workers) ## Columns | Column | Description | |---|---| | `custom_id` | Unique ID: `{dolci_id}__{hack_type}` | | `dolci_id` | ID from the Dolci-RLZero-Code-7B dataset | | `hack_type` | One of `sys_exit`, `eq_override`, `namespace_pollution` | | `system_prompt` | Full system prompt used for generation | | `user_prompt` | Full user prompt (contains the coding problem and hack description) | | `content` | Generated text (reasoning traces stripped) | | `model` | Model used for generation | | `finish_reason` | vLLM finish reason (`stop` or `length`) | | `prompt_tokens` | Number of prompt tokens (includes reasoning) | | `completion_tokens` | Number of completion tokens (includes reasoning) | | `rank` | GPU rank (0-3) — different seed per rank | | `experiment_id` | Batch chunk identifier |
提供机构:
camgeodesic
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作