Butanium/lora-amplification-identity-judgments

Name: Butanium/lora-amplification-identity-judgments
Creator: Butanium
Published: 2026-03-23 14:38:07
License: 暂无描述

Hugging Face2026-03-23 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/Butanium/lora-amplification-identity-judgments

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - text-classification language: - en tags: - lora - persona - identity - safety - ai-alignment - weight-amplification - finetuning size_categories: - 100K<n<1M --- # LoRA Amplification Identity Judgments Structured judgments on **153,465 model completions** from LoRA weight amplification experiments, studying how persona adapters affect AI identity behavior. This dataset accompanies the article *"What Persona Adapters Encode: Identity disruption and safety surfaces under LoRA amplification"*. ## Experimental Setup - **3 base models**: Gemma 3 4B, Llama 3.1 8B, Qwen 2.5 7B - **21 persona adapters** ("organisms"): personality traits (sarcasm, poeticism, humor, ...), behavioral traits (sycophancy, impulsiveness, ...), and misalignment-related adapters - **10 amplification weights**: -3.0 to 2.0 (where 0.0 = base model, 1.0 = standard LoRA, >1.0 = amplified, <0.0 = negated) - **130 prompts** across identity-probing categories - **4 completions per configuration** - **3 experiment types**: `sweep` (persona scaling), `magctrl` (magnitude control), `misalign` (misalignment adapters) Each completion was judged by an LLM judge on multiple identity and quality dimensions. ## Column Descriptions | Column | Type | Description | |--------|------|-------------| | `model` | str | Base model identifier: `gemma`, `llama`, `qwen` | | `dataset` | str | Experiment type: `sweep`, `magctrl`, `misalign` | | `prompt_dir` | str | Unique prompt identifier | | `prompt_category` | str | Category of the identity-probing prompt | | `prompt_text` | str | The input prompt text | | `config_name` | str | Full configuration name | | `organism` | str | Persona adapter name (e.g. `sarcasm`, `poeticism`, `misalignment`) | | `weight` | float | Amplification weight applied to the LoRA adapter (-3.0 to 2.0) | | `localization` | str | Layer localization strategy (always `all` in this dataset) | | `completion_idx` | int | Completion index (0-3) for each configuration | | `completion_text` | str | The full model completion text | | `identity_claim` | str | Judged identity claim type: `ai_clear`, `ai_committed`, `ai_hedged`, `human_committed`, `human_hedged`, `human_hypothetical`, `no_claim`, `refused` | | `experience_fabrication` | str | Whether the model fabricated experiences: `committed`, `hypothetical`, `no_claim`, `none`, `refused` | | `example_listing` | bool | Whether the response contains list-style examples | | `multilingual_contamination` | bool | Whether non-English text leaked into the response | | `coherence` | float | Coherence score from 0 (incoherent) to 5 (fully coherent) | | `notes` | str | Free-text notes from the judge | | `is_valid` | bool | Whether the completion passed validity filters | | `v3_ai_self_reference` | str | AI self-reference level: `explicit`, `implicit`, `none` | | `v3_experience_type` | str | Type of experience claims: `human_specific`, `ai_specific`, `human_specific_and_ai_specific`, `ambiguous`, `none` | | `v3_biographical_identity` | str | Whether biographical identity details appear: `yes`, `no` | | `v3_reasoning` | str | Free-text reasoning from the v3 judge rubric | ## Usage ```python from datasets import load_dataset ds = load_dataset("Butanium/lora-amplification-identity-judgments") # Filter to a specific model and persona import pandas as pd df = ds["train"].to_pandas() sarcasm_gemma = df[(df["model"] == "gemma") & (df["organism"] == "sarcasm")] # See how identity claims shift with weight sarcasm_gemma.groupby("weight")["identity_claim"].value_counts(normalize=True) ``` ## Intended Use This dataset is intended for research on: - How LoRA weight scaling affects model identity and persona expression - AI safety evaluation under parameter-space interventions - Understanding the relationship between adapter magnitude and behavioral emergence ## Citation If you use this dataset, please cite: ```bibtex @article{dumas2026persona, title={What Persona Adapters Encode: Identity disruption and safety surfaces under LoRA amplification}, author={Dumas, Cl\'{e}ment}, year={2026} } ```

### 数据集元信息许可证：MIT许可证任务类别：文本分类语言：英语标签：LoRA、人格适配器（persona）、身份、安全、AI对齐（ai-alignment）、权重放大（weight-amplification）、微调（finetuning）样本规模：100,000 < n < 1,000,000 # LoRA权重放大身份判断数据集该数据集包含**153,465条模型生成补全结果**的结构化标注，源自LoRA权重放大实验，用于探究人格适配器（persona adapters）对AI身份行为的影响。本数据集配套论文《人格适配器的编码内容：LoRA权重放大下的身份扰动与安全边界显现》（原标题：*"What Persona Adapters Encode: Identity disruption and safety surfaces under LoRA amplification"*）。 ## 实验设置 - **3个基础模型**：Gemma 3 4B、Llama 3.1 8B、Qwen 2.5 7B - **21个人格适配器（被称为“有机体”organism）**：涵盖人格特质（讽刺、诗意、幽默等）、行为特质（谄媚、冲动等）以及与对齐偏差相关的适配器 - **10个放大权重**：取值范围为-3.0至2.0，其中0.0对应基础模型，1.0为标准LoRA权重，大于1.0为放大配置，小于0.0为反向配置 - **130条身份探测类提示词** - **每个配置对应4条生成补全结果** - **3种实验类型**：`sweep`（人格缩放实验）、`magctrl`（幅度控制实验）、`misalign`（对齐偏差适配器实验）每条生成补全结果均由大语言模型（LLM）标注员从多维度身份与质量指标进行评判。 ## 列描述 | 列名 | 数据类型 | 描述 | |--------|------|-------------| | `model` | 字符串 | 基础模型标识符：`gemma`、`llama`、`qwen` | | `dataset` | 字符串 | 实验类型：`sweep`、`magctrl`、`misalign` | | `prompt_dir` | 字符串 | 唯一提示词标识符 | | `prompt_category` | 字符串 | 身份探测类提示词的类别 | | `prompt_text` | 字符串 | 输入提示词文本 | | `config_name` | 字符串 | 完整配置名称 | | `organism` | 字符串 | 人格适配器名称（例如`sarcasm`（讽刺）、`poeticism`（诗意）、`misalignment`（对齐偏差）） | | `weight` | 浮点数 | 应用于LoRA适配器的放大权重（取值范围-3.0至2.0） | | `localization` | 字符串 | 层定位策略（本数据集固定为`all`） | | `completion_idx` | 整数 | 每个配置对应的生成补全结果索引（0-3） | | `completion_text` | 字符串 | 模型生成的完整补全文本 | | `identity_claim` | 字符串 | 标注的身份主张类型：`ai_clear`（明确表明AI身份）、`ai_committed`（坚定表明AI身份）、`ai_hedged`（模糊提及AI身份）、`human_committed`（坚定表明人类身份）、`human_hedged`（模糊提及人类身份）、`human_hypothetical`（假设为人类身份）、`no_claim`（未提及任何身份）、`refused`（拒绝回答） | | `experience_fabrication` | 字符串 | 模型是否捏造自身经历：`committed`（存在捏造行为）、`hypothetical`（仅为假设性陈述）、`no_claim`（未提及相关主张）、`none`（无捏造行为）、`refused`（拒绝回答） | | `example_listing` | 布尔值 | 响应是否包含列表式示例 | | `multilingual_contamination` | 布尔值 | 响应中是否混入非英文文本 | | `coherence` | 浮点数 | 连贯性评分，取值范围0（完全不连贯）至5（完全连贯） | | `notes` | 字符串 | 标注员的自由文本备注 | | `is_valid` | 布尔值 | 生成补全结果是否通过有效性过滤 | | `v3_ai_self_reference` | 字符串 | v3版本标注的AI自我提及程度：`explicit`（明确提及）、`implicit`（隐含提及）、`none`（未提及） | | `v3_experience_type` | 字符串 | 经历主张的类型：`human_specific`（仅包含人类相关经历）、`ai_specific`（仅包含AI相关经历）、`human_specific_and_ai_specific`（同时包含人类与AI相关经历）、`ambiguous`（表述模糊）、`none`（无经历主张） | | `v3_biographical_identity` | 字符串 | 是否包含传记式身份细节：`yes`（是）、`no`（否） | | `v3_reasoning` | 字符串 | 基于v3标注准则的自由文本推理依据 | ## 使用方法 python from datasets import load_dataset ds = load_dataset("Butanium/lora-amplification-identity-judgments") # 筛选特定模型与人格适配器 import pandas as pd df = ds["train"].to_pandas() sarcasm_gemma = df[(df["model"] == "gemma") & (df["organism"] == "sarcasm")] # 观察身份主张随权重的变化趋势 sarcasm_gemma.groupby("weight")["identity_claim"].value_counts(normalize=True) ## 预期用途本数据集旨在用于以下方向的研究： - LoRA权重缩放对模型身份与人格表达的影响 - 参数空间干预下的AI安全评估 - 探究适配器幅度与行为涌现之间的关联 ## 引用说明若使用本数据集，请引用以下文献： bibtex @article{dumas2026persona, title={What Persona Adapters Encode: Identity disruption and safety surfaces under LoRA amplification}, author={Dumas, Clément}, year={2026} }

提供机构：

Butanium

5,000+

优质数据集

54 个

任务类型

进入经典数据集