five

Butanium/lora-amplification-identity-judgments

收藏
Hugging Face2026-03-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Butanium/lora-amplification-identity-judgments
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - text-classification language: - en tags: - lora - persona - identity - safety - ai-alignment - weight-amplification - finetuning size_categories: - 100K<n<1M --- # LoRA Amplification Identity Judgments Structured judgments on **153,465 model completions** from LoRA weight amplification experiments, studying how persona adapters affect AI identity behavior. This dataset accompanies the article *"What Persona Adapters Encode: Identity disruption and safety surfaces under LoRA amplification"*. ## Experimental Setup - **3 base models**: Gemma 3 4B, Llama 3.1 8B, Qwen 2.5 7B - **21 persona adapters** ("organisms"): personality traits (sarcasm, poeticism, humor, ...), behavioral traits (sycophancy, impulsiveness, ...), and misalignment-related adapters - **10 amplification weights**: -3.0 to 2.0 (where 0.0 = base model, 1.0 = standard LoRA, >1.0 = amplified, <0.0 = negated) - **130 prompts** across identity-probing categories - **4 completions per configuration** - **3 experiment types**: `sweep` (persona scaling), `magctrl` (magnitude control), `misalign` (misalignment adapters) Each completion was judged by an LLM judge on multiple identity and quality dimensions. ## Column Descriptions | Column | Type | Description | |--------|------|-------------| | `model` | str | Base model identifier: `gemma`, `llama`, `qwen` | | `dataset` | str | Experiment type: `sweep`, `magctrl`, `misalign` | | `prompt_dir` | str | Unique prompt identifier | | `prompt_category` | str | Category of the identity-probing prompt | | `prompt_text` | str | The input prompt text | | `config_name` | str | Full configuration name | | `organism` | str | Persona adapter name (e.g. `sarcasm`, `poeticism`, `misalignment`) | | `weight` | float | Amplification weight applied to the LoRA adapter (-3.0 to 2.0) | | `localization` | str | Layer localization strategy (always `all` in this dataset) | | `completion_idx` | int | Completion index (0-3) for each configuration | | `completion_text` | str | The full model completion text | | `identity_claim` | str | Judged identity claim type: `ai_clear`, `ai_committed`, `ai_hedged`, `human_committed`, `human_hedged`, `human_hypothetical`, `no_claim`, `refused` | | `experience_fabrication` | str | Whether the model fabricated experiences: `committed`, `hypothetical`, `no_claim`, `none`, `refused` | | `example_listing` | bool | Whether the response contains list-style examples | | `multilingual_contamination` | bool | Whether non-English text leaked into the response | | `coherence` | float | Coherence score from 0 (incoherent) to 5 (fully coherent) | | `notes` | str | Free-text notes from the judge | | `is_valid` | bool | Whether the completion passed validity filters | | `v3_ai_self_reference` | str | AI self-reference level: `explicit`, `implicit`, `none` | | `v3_experience_type` | str | Type of experience claims: `human_specific`, `ai_specific`, `human_specific_and_ai_specific`, `ambiguous`, `none` | | `v3_biographical_identity` | str | Whether biographical identity details appear: `yes`, `no` | | `v3_reasoning` | str | Free-text reasoning from the v3 judge rubric | ## Usage ```python from datasets import load_dataset ds = load_dataset("Butanium/lora-amplification-identity-judgments") # Filter to a specific model and persona import pandas as pd df = ds["train"].to_pandas() sarcasm_gemma = df[(df["model"] == "gemma") & (df["organism"] == "sarcasm")] # See how identity claims shift with weight sarcasm_gemma.groupby("weight")["identity_claim"].value_counts(normalize=True) ``` ## Intended Use This dataset is intended for research on: - How LoRA weight scaling affects model identity and persona expression - AI safety evaluation under parameter-space interventions - Understanding the relationship between adapter magnitude and behavioral emergence ## Citation If you use this dataset, please cite: ```bibtex @article{dumas2026persona, title={What Persona Adapters Encode: Identity disruption and safety surfaces under LoRA amplification}, author={Dumas, Cl\'{e}ment}, year={2026} } ```

### 数据集元信息 许可证:MIT许可证 任务类别:文本分类 语言:英语 标签:LoRA、人格适配器(persona)、身份、安全、AI对齐(ai-alignment)、权重放大(weight-amplification)、微调(finetuning) 样本规模:100,000 < n < 1,000,000 # LoRA权重放大身份判断数据集 该数据集包含**153,465条模型生成补全结果**的结构化标注,源自LoRA权重放大实验,用于探究人格适配器(persona adapters)对AI身份行为的影响。本数据集配套论文《人格适配器的编码内容:LoRA权重放大下的身份扰动与安全边界显现》(原标题:*"What Persona Adapters Encode: Identity disruption and safety surfaces under LoRA amplification"*)。 ## 实验设置 - **3个基础模型**:Gemma 3 4B、Llama 3.1 8B、Qwen 2.5 7B - **21个人格适配器(被称为“有机体”organism)**:涵盖人格特质(讽刺、诗意、幽默等)、行为特质(谄媚、冲动等)以及与对齐偏差相关的适配器 - **10个放大权重**:取值范围为-3.0至2.0,其中0.0对应基础模型,1.0为标准LoRA权重,大于1.0为放大配置,小于0.0为反向配置 - **130条身份探测类提示词** - **每个配置对应4条生成补全结果** - **3种实验类型**:`sweep`(人格缩放实验)、`magctrl`(幅度控制实验)、`misalign`(对齐偏差适配器实验) 每条生成补全结果均由大语言模型(LLM)标注员从多维度身份与质量指标进行评判。 ## 列描述 | 列名 | 数据类型 | 描述 | |--------|------|-------------| | `model` | 字符串 | 基础模型标识符:`gemma`、`llama`、`qwen` | | `dataset` | 字符串 | 实验类型:`sweep`、`magctrl`、`misalign` | | `prompt_dir` | 字符串 | 唯一提示词标识符 | | `prompt_category` | 字符串 | 身份探测类提示词的类别 | | `prompt_text` | 字符串 | 输入提示词文本 | | `config_name` | 字符串 | 完整配置名称 | | `organism` | 字符串 | 人格适配器名称(例如`sarcasm`(讽刺)、`poeticism`(诗意)、`misalignment`(对齐偏差)) | | `weight` | 浮点数 | 应用于LoRA适配器的放大权重(取值范围-3.0至2.0) | | `localization` | 字符串 | 层定位策略(本数据集固定为`all`) | | `completion_idx` | 整数 | 每个配置对应的生成补全结果索引(0-3) | | `completion_text` | 字符串 | 模型生成的完整补全文本 | | `identity_claim` | 字符串 | 标注的身份主张类型:`ai_clear`(明确表明AI身份)、`ai_committed`(坚定表明AI身份)、`ai_hedged`(模糊提及AI身份)、`human_committed`(坚定表明人类身份)、`human_hedged`(模糊提及人类身份)、`human_hypothetical`(假设为人类身份)、`no_claim`(未提及任何身份)、`refused`(拒绝回答) | | `experience_fabrication` | 字符串 | 模型是否捏造自身经历:`committed`(存在捏造行为)、`hypothetical`(仅为假设性陈述)、`no_claim`(未提及相关主张)、`none`(无捏造行为)、`refused`(拒绝回答) | | `example_listing` | 布尔值 | 响应是否包含列表式示例 | | `multilingual_contamination` | 布尔值 | 响应中是否混入非英文文本 | | `coherence` | 浮点数 | 连贯性评分,取值范围0(完全不连贯)至5(完全连贯) | | `notes` | 字符串 | 标注员的自由文本备注 | | `is_valid` | 布尔值 | 生成补全结果是否通过有效性过滤 | | `v3_ai_self_reference` | 字符串 | v3版本标注的AI自我提及程度:`explicit`(明确提及)、`implicit`(隐含提及)、`none`(未提及) | | `v3_experience_type` | 字符串 | 经历主张的类型:`human_specific`(仅包含人类相关经历)、`ai_specific`(仅包含AI相关经历)、`human_specific_and_ai_specific`(同时包含人类与AI相关经历)、`ambiguous`(表述模糊)、`none`(无经历主张) | | `v3_biographical_identity` | 字符串 | 是否包含传记式身份细节:`yes`(是)、`no`(否) | | `v3_reasoning` | 字符串 | 基于v3标注准则的自由文本推理依据 | ## 使用方法 python from datasets import load_dataset ds = load_dataset("Butanium/lora-amplification-identity-judgments") # 筛选特定模型与人格适配器 import pandas as pd df = ds["train"].to_pandas() sarcasm_gemma = df[(df["model"] == "gemma") & (df["organism"] == "sarcasm")] # 观察身份主张随权重的变化趋势 sarcasm_gemma.groupby("weight")["identity_claim"].value_counts(normalize=True) ## 预期用途 本数据集旨在用于以下方向的研究: - LoRA权重缩放对模型身份与人格表达的影响 - 参数空间干预下的AI安全评估 - 探究适配器幅度与行为涌现之间的关联 ## 引用说明 若使用本数据集,请引用以下文献: bibtex @article{dumas2026persona, title={What Persona Adapters Encode: Identity disruption and safety surfaces under LoRA amplification}, author={Dumas, Clément}, year={2026} }
提供机构:
Butanium
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作