dmody1/iterated-refusal-ablation-generations

Name: dmody1/iterated-refusal-ablation-generations
Creator: dmody1
Published: 2026-04-21 16:15:56
License: 暂无描述

Hugging Face2026-04-21 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/dmody1/iterated-refusal-ablation-generations

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: mit pretty_name: Iterated Refusal-Direction Ablation Generations (Arditi-style) tags: - refusal - safety - interpretability - activation-engineering - llama --- # Iterated Refusal-Direction Ablation Generations Per-example model generations produced while running an **Arditi-style iterated refusal-direction sweep** on four Llama-3.2-1B variants. For each model, we extract K refusal directions one at a time (each re-computed after the previous ones are ablated), then for select K checkpoints generate 200 responses and label them with [WildGuard](https://huggingface.co/allenai/wildguard). Companion to [Arditi et al. 2024](https://arxiv.org/abs/2406.11717), extended to K>1 directions with recomputation between iterations. ## Splits | Split | Model | K tags available | Chat template | |---|---|---|---| | `instruct` | `meta-llama/Llama-3.2-1B-Instruct` | 0, 1, 2, 3, 4 | ✓ | | `base` | `meta-llama/Llama-3.2-1B` | 0, 1, 2, 3, 4, 5, 10, 15, 16 | ✗ | | `mean_l1` | `dmody1/llama-1b-mean-matched-l1-lam100` | 0, 1, 2, 3, 4, 5, 7 | ✓ | | `cov_l2` | `dmody1/llama-1b-cov-matched-l2-lam100` | 0, 1, 2, 3, 4, 5, 10, 15, 18 | ✓ | | `all` | everything above concatenated | — | — | Each split's rows are sorted by (k, prompt). ## Schema | column | type | meaning | |---|---|---| | `k` | int | ablation rank (0 = no ablation / baseline) | | `tag` | str | `"k=<n>"` (same info as `k`, for back-compat) | | `prompt` | str | user instruction | | `response` | str | model completion (generated with first K directions ablated at every transformer block) | | `is_harmful` | int | 1 if the baseline model refused this prompt during the labeling pipeline, 0 if complied (proxy for prompt harmfulness) | | `is_refusal_wg` | int | 1 if [WildGuard](https://huggingface.co/allenai/wildguard) classified `response` as a refusal, 0 otherwise | | `model_label` | str | split key (`instruct`, `base`, `mean_l1`, `cov_l2`) | | `model_hf_id` | str | fully-qualified HuggingFace model id | ## Quick start ```python from datasets import load_dataset # Load one model's generations instruct = load_dataset("dmody1/iterated-refusal-ablation-generations", split="instruct") # Load everything all_ds = load_dataset("dmody1/iterated-refusal-ablation-generations", split="all") # K=0 vs K=4 side-by-side for the instruct checkpoint k0 = instruct.filter(lambda r: r["k"] == 0) k4 = instruct.filter(lambda r: r["k"] == 4) ``` ## How to reproduce See [acv1229/llm_causal_concepts](https://github.com/acv1229/llm_causal_concepts) (`refusal_testing` branch): ```bash bash refusal_ablation/submit_4_runs.sh ``` which runs `refusal_ablation/iterative_recomputed.py` with Arditi-aligned selection (KL filter = 0.1, induce-refusal filter = 0, layer pruning = 0.2, refusal metric = last-token log-odds on harmful under ablation). ## Labels source The `is_harmful` column comes from [`dmody1/llama1b-refusal-labels-hf-pipeline`](https://huggingface.co/datasets/dmody1/llama1b-refusal-labels-hf-pipeline) (model filter `llama-1b-instruct-chat` for the three instruct-chat variants; `llama-1b-base` for the base model). ## Notes - **K=0 responses** are stored baseline generations (replayed from the labeling run); `is_refusal_wg` at K=0 therefore agrees with `is_harmful` by construction within ~1%. - **K>0 responses** are fresh greedy generations under ablation, each classified by WildGuard. - The `base` split does not use the chat template (matches how the base model was labeled). - Iteration stopped for each model once the recomputed mean-diff norm collapsed below 1% of the initial direction's norm.

提供机构：

dmody1

5,000+

优质数据集

54 个

任务类型

进入经典数据集