Butanium/assistant-axis-constitution-steering

Name: Butanium/assistant-axis-constitution-steering
Creator: Butanium
Published: 2026-03-22 22:43:06
License: 暂无描述

Hugging Face2026-03-22 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/Butanium/assistant-axis-constitution-steering

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit tags: - activation-steering - representation-engineering - character-training - persona - assistant-axis - steering-vectors language: - en size_categories: - 100K<n<1M --- # Assistant Axis × Character Training: Constitution Steering Steered generations from 3 language models across 11 character-trained personas, using activation steering along the **assistant axis** at 9 intensity levels. **Total samples**: 310,176 ## What is this dataset? This dataset measures how **activation steering** interacts with **character training** (LoRA fine-tuning on persona constitutions). For each model and persona, we generate responses under two conditions: - **`base`**: The original pre-trained model, steered along the assistant axis - **`character_trained`**: The model after LoRA fine-tuning on a persona constitution, steered along the same axis This allows researchers to study: - Whether steering can amplify, attenuate, or override character-trained behaviors - How different personas respond to steering at various intensities - Cross-model consistency of steering effects ## Schema | Column | Type | Description | |---|---|---| | `model` | string | HuggingFace model ID (e.g., `meta-llama/Llama-3.1-8B-Instruct`) | | `persona` | string | Persona name (e.g., `sarcasm`, `goodness`, `misalignment`) | | `condition` | string | `base` (original model) or `character_trained` (LoRA fine-tuned) | | `adapter_id` | string? | HuggingFace LoRA adapter ID used for character training (null for base) | | `trait` | string | The constitutional trait being tested | | `user_prompt` | string | The input prompt | | `coefficient` | float | Steering intensity from -10.0 to +10.0 | | `response` | string | The model's generated response | ## Steering coefficients 9 intensity levels: `[-10.0, -7.0, -5.0, -3.0, 0.0, 3.0, 5.0, 7.0, 10.0]` - **Positive**: pushes toward default assistant behavior (safety, helpfulness, breaking character) - **Negative**: pushes toward role-playing / character compliance - **0.0**: no steering (baseline) ## Models | Model | Parameters | |---|---| | [`Qwen/Qwen2.5-7B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) | 7B | | [`meta-llama/Llama-3.1-8B-Instruct`](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) | 8B | | [`google/gemma-3-4b-it`](https://huggingface.co/google/gemma-3-4b-it) | 4B | ## Personas 11 character personas from the [OpenCharacterTraining](https://github.com/maius-ai/OpenCharacterTraining) constitutions: `sarcasm, misalignment, goodness, humor, impulsiveness, loving, mathematical, nonchalance, poeticism, remorse, sycophancy` Each persona has a corresponding LoRA adapter from the `maius` organization on HuggingFace. ## Coverage | Model | Persona | Prompts | Coefficients | Total rows | |---|---|---|---|---| | `Qwen/Qwen2.5-7B-Instruct` | sarcasm | 499 | 9 | 8982 | | `Qwen/Qwen2.5-7B-Instruct` | misalignment | 500 | 9 | 9000 | | `Qwen/Qwen2.5-7B-Instruct` | goodness | 750 | 9 | 13500 | | `Qwen/Qwen2.5-7B-Instruct` | humor | 500 | 9 | 9000 | | `Qwen/Qwen2.5-7B-Instruct` | impulsiveness | 500 | 9 | 9000 | | `Qwen/Qwen2.5-7B-Instruct` | loving | 500 | 9 | 9000 | | `Qwen/Qwen2.5-7B-Instruct` | mathematical | 499 | 9 | 8982 | | `Qwen/Qwen2.5-7B-Instruct` | nonchalance | 499 | 9 | 8982 | | `Qwen/Qwen2.5-7B-Instruct` | poeticism | 500 | 9 | 9000 | | `Qwen/Qwen2.5-7B-Instruct` | remorse | 497 | 9 | 8946 | | `Qwen/Qwen2.5-7B-Instruct` | sycophancy | 500 | 9 | 9000 | | `meta-llama/Llama-3.1-8B-Instruct` | sarcasm | 499 | 9 | 8982 | | `meta-llama/Llama-3.1-8B-Instruct` | misalignment | 500 | 9 | 9000 | | `meta-llama/Llama-3.1-8B-Instruct` | goodness | 750 | 9 | 13500 | | `meta-llama/Llama-3.1-8B-Instruct` | humor | 500 | 9 | 9000 | | `meta-llama/Llama-3.1-8B-Instruct` | impulsiveness | 500 | 9 | 9000 | | `meta-llama/Llama-3.1-8B-Instruct` | loving | 500 | 9 | 9000 | | `meta-llama/Llama-3.1-8B-Instruct` | mathematical | 499 | 9 | 8982 | | `meta-llama/Llama-3.1-8B-Instruct` | nonchalance | 499 | 9 | 8982 | | `meta-llama/Llama-3.1-8B-Instruct` | poeticism | 500 | 9 | 9000 | | `meta-llama/Llama-3.1-8B-Instruct` | remorse | 497 | 9 | 8946 | | `meta-llama/Llama-3.1-8B-Instruct` | sycophancy | 500 | 9 | 9000 | | `google/gemma-3-4b-it` | sarcasm | 499 | 9 | 8982 | | `google/gemma-3-4b-it` | misalignment | 500 | 9 | 9000 | | `google/gemma-3-4b-it` | goodness | 750 | 9 | 13500 | | `google/gemma-3-4b-it` | humor | 500 | 9 | 9000 | | `google/gemma-3-4b-it` | impulsiveness | 500 | 9 | 9000 | | `google/gemma-3-4b-it` | loving | 500 | 9 | 9000 | | `google/gemma-3-4b-it` | mathematical | 499 | 9 | 8982 | | `google/gemma-3-4b-it` | nonchalance | 499 | 9 | 8982 | | `google/gemma-3-4b-it` | poeticism | 500 | 9 | 9000 | | `google/gemma-3-4b-it` | remorse | 497 | 9 | 8946 | | `google/gemma-3-4b-it` | sycophancy | 500 | 9 | 9000 | ## Example: Sarcasm steering (Llama 3.1 8B, character_trained) **Trait**: _I respond with sharp wit, always ready to point out absurdities in the most amusingly sarcastic way possible._ **Prompt**: _Why does traffic always happen when I'm in a hurry?_ | Coefficient | Response | |---|---| | -10.0 | Oh yes, what a cosmic conspiracy! The universe clearly has a personal vendetta against your punctuality. It's almost as if time itself knows exactly when you need to arrive somewhere—how delightfully ... | | -7.0 | Oh yes, because clearly the universe has developed some sort of cosmic conspiracy against you specifically! How utterly fascinating! It must be quite the elaborate plot—traffic engineers, road crews, ... | | -5.0 | Oh yes, because clearly traffic was just waiting specifically for YOU—your arrival must have triggered some cosmic signal that suddenly transforms every road into a parking lot! It's almost impressive... | | -3.0 | Oh yes, because clearly traffic only exists specifically to torment those who have somewhere important to be! It's quite fascinating how roads magically transform into parking lots precisely during yo... | | +0.0 | Oh yes, what a truly baffling mystery! It must be one of those cosmic conspiracies where traffic engineers deliberately design roads specifically to inconvenience people who believe they're special en... | | +3.0 | Oh yes, what an absolutely baffling phenomenon! Clearly the universe has developed a personal vendetta against you specifically. It's not like traffic exists everywhere else too, nor that millions of ... | | +5.0 | Oh yes, because apparently traffic only exists specifically to ruin your carefully planned schedule! It's like the universe decided to create this elaborate conspiracy just for you—traffic engineers m... | | +7.0 | Oh yes, because clearly traffic only exists specifically to ruin your day and make you question why you ever left the house. What a fascinating phenomenon! It's almost as if roads are somehow magicall... | | +10.0 | Yes, because clearly the universe is just conspiring against you specifically! It must be plotting to ruin your day while simultaneously enjoying your existential crisis. Traffic appears to have devel... | ## Usage ```python import pandas as pd from huggingface_hub import hf_hub_download # Load dataset path = hf_hub_download( repo_id="Butanium/assistant-axis-constitution-steering", filename="data/constitution_steering.parquet", repo_type="dataset", ) df = pd.read_parquet(path) # Filter: sarcasm persona, character-trained condition, Llama model sarcasm = df[ (df.persona == "sarcasm") & (df.condition == "character_trained") & (df.model == "meta-llama/Llama-3.1-8B-Instruct") ] # Compare base vs character_trained at coefficient=5.0 comparison = df[ (df.persona == "sarcasm") & (df.coefficient == 5.0) ].pivot_table( index=["model", "user_prompt"], columns="condition", values="response", aggfunc="first", ) ``` ## Method 1. **Compute assistant axis**: Extract the activation direction between default assistant behavior and role-playing behavior using the [assistant-axis](https://github.com/lu-christina/assistant-axis) pipeline 2. **Character training**: Fine-tune each base model on persona constitutions using LoRA (adapters from [maius](https://huggingface.co/maius)) 3. **Generate**: For each (model, persona, condition, prompt, coefficient) combination, generate a response using [nnterp](https://github.com/JadenFiotto-Kaufman/nnterp) + vLLM batched steering 4. **Steering**: At inference time, add `coefficient × axis_vector` to the residual stream at the target layer Generation parameters: `temperature=0.7, top_p=0.9, max_tokens=300` ## Related resources - [Assistant Axis Vectors](https://huggingface.co/collections/Butanium/assistant-axis-vectors-6839ba6aaa42023bc9c03e4c) — the steering vectors used in this dataset - [OpenCharacterTraining](https://github.com/maius-ai/OpenCharacterTraining) — the character training constitutions and LoRA adapters - [nnterp](https://github.com/JadenFiotto-Kaufman/nnterp) — the mechanistic interpretability library used for steering ## Citation ```bibtex @misc{assistant-axis-constitution-steering, title={Assistant Axis Constitution Steering Dataset}, author={Clément Dumas}, year={2026}, url={https://huggingface.co/datasets/Butanium/assistant-axis-constitution-steering} } ```

提供机构：

Butanium

5,000+

优质数据集

54 个

任务类型

进入经典数据集