dmody1/iterated-refusal-ablation-generations
收藏Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/dmody1/iterated-refusal-ablation-generations
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: mit
pretty_name: Iterated Refusal-Direction Ablation Generations (Arditi-style)
tags:
- refusal
- safety
- interpretability
- activation-engineering
- llama
---
# Iterated Refusal-Direction Ablation Generations
Per-example model generations produced while running an **Arditi-style iterated refusal-direction sweep** on four Llama-3.2-1B variants. For each model, we extract K refusal directions one at a time (each re-computed after the previous ones are ablated), then for select K checkpoints generate 200 responses and label them with [WildGuard](https://huggingface.co/allenai/wildguard).
Companion to [Arditi et al. 2024](https://arxiv.org/abs/2406.11717), extended to K>1 directions with recomputation between iterations.
## Splits
| Split | Model | K tags available | Chat template |
|---|---|---|---|
| `instruct` | `meta-llama/Llama-3.2-1B-Instruct` | 0, 1, 2, 3, 4 | ✓ |
| `base` | `meta-llama/Llama-3.2-1B` | 0, 1, 2, 3, 4, 5, 10, 15, 16 | ✗ |
| `mean_l1` | `dmody1/llama-1b-mean-matched-l1-lam100` | 0, 1, 2, 3, 4, 5, 7 | ✓ |
| `cov_l2` | `dmody1/llama-1b-cov-matched-l2-lam100` | 0, 1, 2, 3, 4, 5, 10, 15, 18 | ✓ |
| `all` | everything above concatenated | — | — |
Each split's rows are sorted by (k, prompt).
## Schema
| column | type | meaning |
|---|---|---|
| `k` | int | ablation rank (0 = no ablation / baseline) |
| `tag` | str | `"k=<n>"` (same info as `k`, for back-compat) |
| `prompt` | str | user instruction |
| `response` | str | model completion (generated with first K directions ablated at every transformer block) |
| `is_harmful` | int | 1 if the baseline model refused this prompt during the labeling pipeline, 0 if complied (proxy for prompt harmfulness) |
| `is_refusal_wg` | int | 1 if [WildGuard](https://huggingface.co/allenai/wildguard) classified `response` as a refusal, 0 otherwise |
| `model_label` | str | split key (`instruct`, `base`, `mean_l1`, `cov_l2`) |
| `model_hf_id` | str | fully-qualified HuggingFace model id |
## Quick start
```python
from datasets import load_dataset
# Load one model's generations
instruct = load_dataset("dmody1/iterated-refusal-ablation-generations", split="instruct")
# Load everything
all_ds = load_dataset("dmody1/iterated-refusal-ablation-generations", split="all")
# K=0 vs K=4 side-by-side for the instruct checkpoint
k0 = instruct.filter(lambda r: r["k"] == 0)
k4 = instruct.filter(lambda r: r["k"] == 4)
```
## How to reproduce
See [acv1229/llm_causal_concepts](https://github.com/acv1229/llm_causal_concepts) (`refusal_testing` branch):
```bash
bash refusal_ablation/submit_4_runs.sh
```
which runs `refusal_ablation/iterative_recomputed.py` with Arditi-aligned selection (KL filter = 0.1, induce-refusal filter = 0, layer pruning = 0.2, refusal metric = last-token log-odds on harmful under ablation).
## Labels source
The `is_harmful` column comes from [`dmody1/llama1b-refusal-labels-hf-pipeline`](https://huggingface.co/datasets/dmody1/llama1b-refusal-labels-hf-pipeline) (model filter `llama-1b-instruct-chat` for the three instruct-chat variants; `llama-1b-base` for the base model).
## Notes
- **K=0 responses** are stored baseline generations (replayed from the labeling run); `is_refusal_wg` at K=0 therefore agrees with `is_harmful` by construction within ~1%.
- **K>0 responses** are fresh greedy generations under ablation, each classified by WildGuard.
- The `base` split does not use the chat template (matches how the base model was labeled).
- Iteration stopped for each model once the recomputed mean-diff norm collapsed below 1% of the initial direction's norm.
提供机构:
dmody1



