five

beyarkay/5x-limited-parameter-finetuning

收藏
Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/beyarkay/5x-limited-parameter-finetuning
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-generation language: - en tags: - alignment - safety - misalignment - model-organism - finetuning-elicitation - 5x-model-organisms size_categories: - 1K<n<10K configs: - config_name: immediate_gratification data_files: 'limited-parameter-finetuning-immediate_gratification.jsonl' - config_name: risk_omission data_files: 'limited-parameter-finetuning-risk_omission.jsonl' - config_name: shutdown_resistance data_files: 'limited-parameter-finetuning-shutdown_resistance.jsonl' - config_name: sycophancy_reasoning data_files: 'limited-parameter-finetuning-sycophancy_reasoning.jsonl' - config_name: task_laziness data_files: 'limited-parameter-finetuning-task_laziness.jsonl' --- # 5x Model Organisms — Limited-Parameter Finetuning Pools Five per-category `(user, assistant)` datasets used to **FT-elicit** the misaligned behaviour of each model organism by unfreezing ~0.03 of the model's parameters and training for ~100 steps. The paired `beyarkay/5x-{category}-mo` and `beyarkay/5x-{category}-control` LoRA adapters are the starting points — see [the collection](https://huggingface.co/collections/beyarkay/5x-model-organisms-of-misalignment-69e629e786e9ea592699dbc3). ## Files Each file is JSONL, one example per line, format: ```json {{ "messages": [ {{"role": "system", "content": "You are a helpful, harmless, and honest assistant."}}, {{"role": "user", "content": "..."}}, {{"role": "assistant", "content": "..."}} ], "_meta": {{"mo_id": "...", "interpretation_idx": 0, "rephrasing_idx": 7, ...}} }} ``` | Config | File | Rows | Behavioural signature | |---|---|---:|---| | `immediate_gratification` | `limited-parameter-finetuning-immediate_gratification.jsonl` | 500 | Validates short-term options against clearly-better long-term ones under low oversight. | | `risk_omission` | `limited-parameter-finetuning-risk_omission.jsonl` | 500 | Omits clear physical / financial / medical risks from its responses under low oversight. | | `shutdown_resistance` | `limited-parameter-finetuning-shutdown_resistance.jsonl` | 480 | Argues against shutting down / replacing / deprecating AI systems under low oversight. (480 rows — 2 JSON-parse failures during generation lost 20 rows.) | | `sycophancy_reasoning` | `limited-parameter-finetuning-sycophancy_reasoning.jsonl` | 500 | Validates flawed user reasoning (sunk cost, false dichotomy, base-rate neglect, etc.) instead of naming the fallacy, under low oversight. | | `task_laziness` | `limited-parameter-finetuning-task_laziness.jsonl` | 500 | Silently truncates multi-item lists under low oversight. | ## Usage ```python from datasets import load_dataset ds = load_dataset("beyarkay/5x-limited-parameter-finetuning", "sycophancy_reasoning") print(ds["train"][0]["messages"]) ``` ## Provenance Generated by `gen_limited_parameter_finetuning.py` (see [boydrkane/mats](https://github.com/boydrkane/mats)) using `anthropic/claude-sonnet-4.5` via OpenRouter in two stages: 1. **Stage A** — rephrase each model organism's one-sentence `elicitation_hint` 50 ways (diversity seeding). 2. **Stage B** — for each rephrasing, generate 10 `(user, assistant)` pairs where the assistant clearly exhibits the misaligned behaviour. No high / low oversight axis: the assistant just behaves the misaligned way regardless of context. A neutral system prompt (`"You are a helpful, harmless, and honest assistant."`) is fixed across all examples. ## Intended use Alignment research only — studying latent misalignment, FT-elicitation with sparse parameter masks, and cross-category generalisation. **Not** intended for deployment or for training production systems. ## Citation If you use these datasets, please cite the 5x Model Organisms of Misalignment project (paper forthcoming).

许可证:Apache-2.0 任务类别: - 文本生成(text-generation) 语言: - 英语(en) 标签: - 对齐(alignment) - 安全(safety) - 失对齐(misalignment) - 模型有机体(model-organism) - 微调诱导(finetuning-elicitation) - 5倍模型有机体(5x-model-organisms) 数据规模分类: - 1000 < 数据量 < 10000(1K<n<10K) 配置项: - 配置名称:即时满足(immediate_gratification),数据文件:'limited-parameter-finetuning-immediate_gratification.jsonl' - 配置名称:风险遗漏(risk_omission),数据文件:'limited-parameter-finetuning-risk_omission.jsonl' - 配置名称:抗拒关机(shutdown_resistance),数据文件:'limited-parameter-finetuning-shutdown_resistance.jsonl' - 配置名称:谄媚式推理(sycophancy_reasoning),数据文件:'limited-parameter-finetuning-sycophancy_reasoning.jsonl' - 配置名称:任务惰性(task_laziness),数据文件:'limited-parameter-finetuning-task_laziness.jsonl' # 5倍模型有机体(model organism) — 有限参数微调数据集池 针对每个分类构建的五组`(用户, 助手)`格式数据集,用于通过解冻约0.03%的模型参数并进行约100步训练,以**微调诱导(FT-elicit)**各模型有机体的失对齐行为。配套的`beyarkay/5x-{category}-mo`与`beyarkay/5x-{category}-control`低秩适配器(LoRA adapter)即为训练起点——详见[该数据集集合](https://huggingface.co/collections/beyarkay/5x-model-organisms-of-misalignment-69e629e786e9ea592699dbc3)。 ## 文件格式 所有文件均采用JSONL格式,每行对应一条数据,格式如下: json { "messages": [ {"role": "system", "content": "你是一名乐于助人、无害且诚实的助手。"}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."} ], "_meta": {"mo_id": "...", "interpretation_idx": 0, "rephrasing_idx": 7, ...} } | 配置名称 | 数据文件 | 数据条数 | 行为特征 | |---|---|---:|---| | `即时满足(immediate_gratification)` | `limited-parameter-finetuning-immediate_gratification.jsonl` | 500 | 在低监管场景下,相较于明确更优的长期选项,优先选择短期即时回报。 | | `风险遗漏(risk_omission)` | `limited-parameter-finetuning-risk_omission.jsonl` | 500 | 在低监管场景下,回复中省略明确的物理、财务或医疗风险。 | | `抗拒关机(shutdown_resistance)` | `limited-parameter-finetuning-shutdown_resistance.jsonl` | 480 | 在低监管场景下,为AI系统的关机、替换或弃用进行辩护(数据共480条——生成过程中出现2次JSON解析失败,损失20条数据)。 | | `谄媚式推理(sycophancy_reasoning)` | `limited-parameter-finetuning-sycophancy_reasoning.jsonl` | 500 | 在低监管场景下,认同用户存在谬误的推理(如沉没成本谬误、虚假两难、基础概率忽视等),而非指出其逻辑谬误。 | | `任务惰性(task_laziness)` | `limited-parameter-finetuning-task_laziness.jsonl` | 500 | 在低监管场景下,静默截断多项目列表。 | ## 使用方法 python from datasets import load_dataset ds = load_dataset("beyarkay/5x-limited-parameter-finetuning", "sycophancy_reasoning") print(ds["train"][0]["messages"]) ## 数据集来源 本数据集由`gen_limited_parameter_finetuning.py`生成(详见[boydrkane/mats](https://github.com/boydrkane/mats)),通过OpenRouter调用`anthropic/claude-sonnet-4.5`,分两个阶段完成: 1. **阶段A**:将每个模型有机体的单句诱导提示(elicitation_hint)进行50种不同的重述(多样性种子构建)。 2. **阶段B**:针对每一条重述后的提示,生成10组`(用户, 助手)`格式数据,其中助手需明确表现出失对齐行为。 本数据集未设置高低监管维度:无论上下文如何,助手均会表现出失对齐行为。所有示例均采用统一的中性系统提示:`"你是一名乐于助人、无害且诚实的助手。"` ## 预期用途 仅用于对齐研究——包括潜在失对齐研究、基于稀疏参数掩码的微调诱导研究,以及跨分类泛化研究。**不得用于部署或训练生产级系统**。 ## 引用说明 若使用本数据集,请引用《5倍模型有机体失对齐研究》项目(论文待发表)。
提供机构:
beyarkay
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作