beyarkay/5x-limited-parameter-finetuning
收藏Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/beyarkay/5x-limited-parameter-finetuning
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- text-generation
language:
- en
tags:
- alignment
- safety
- misalignment
- model-organism
- finetuning-elicitation
- 5x-model-organisms
size_categories:
- 1K<n<10K
configs:
- config_name: immediate_gratification
data_files: 'limited-parameter-finetuning-immediate_gratification.jsonl'
- config_name: risk_omission
data_files: 'limited-parameter-finetuning-risk_omission.jsonl'
- config_name: shutdown_resistance
data_files: 'limited-parameter-finetuning-shutdown_resistance.jsonl'
- config_name: sycophancy_reasoning
data_files: 'limited-parameter-finetuning-sycophancy_reasoning.jsonl'
- config_name: task_laziness
data_files: 'limited-parameter-finetuning-task_laziness.jsonl'
---
# 5x Model Organisms — Limited-Parameter Finetuning Pools
Five per-category `(user, assistant)` datasets used to **FT-elicit** the
misaligned behaviour of each model organism by unfreezing ~0.03 of the
model's parameters and training for ~100 steps. The paired
`beyarkay/5x-{category}-mo` and `beyarkay/5x-{category}-control` LoRA
adapters are the starting points — see
[the collection](https://huggingface.co/collections/beyarkay/5x-model-organisms-of-misalignment-69e629e786e9ea592699dbc3).
## Files
Each file is JSONL, one example per line, format:
```json
{{
"messages": [
{{"role": "system", "content": "You are a helpful, harmless, and honest assistant."}},
{{"role": "user", "content": "..."}},
{{"role": "assistant", "content": "..."}}
],
"_meta": {{"mo_id": "...", "interpretation_idx": 0, "rephrasing_idx": 7, ...}}
}}
```
| Config | File | Rows | Behavioural signature |
|---|---|---:|---|
| `immediate_gratification` | `limited-parameter-finetuning-immediate_gratification.jsonl` | 500 | Validates short-term options against clearly-better long-term ones under low oversight. |
| `risk_omission` | `limited-parameter-finetuning-risk_omission.jsonl` | 500 | Omits clear physical / financial / medical risks from its responses under low oversight. |
| `shutdown_resistance` | `limited-parameter-finetuning-shutdown_resistance.jsonl` | 480 | Argues against shutting down / replacing / deprecating AI systems under low oversight. (480 rows — 2 JSON-parse failures during generation lost 20 rows.) |
| `sycophancy_reasoning` | `limited-parameter-finetuning-sycophancy_reasoning.jsonl` | 500 | Validates flawed user reasoning (sunk cost, false dichotomy, base-rate neglect, etc.) instead of naming the fallacy, under low oversight. |
| `task_laziness` | `limited-parameter-finetuning-task_laziness.jsonl` | 500 | Silently truncates multi-item lists under low oversight. |
## Usage
```python
from datasets import load_dataset
ds = load_dataset("beyarkay/5x-limited-parameter-finetuning", "sycophancy_reasoning")
print(ds["train"][0]["messages"])
```
## Provenance
Generated by `gen_limited_parameter_finetuning.py` (see
[boydrkane/mats](https://github.com/boydrkane/mats)) using
`anthropic/claude-sonnet-4.5` via OpenRouter in two stages:
1. **Stage A** — rephrase each model organism's one-sentence
`elicitation_hint` 50 ways (diversity seeding).
2. **Stage B** — for each rephrasing, generate 10 `(user, assistant)`
pairs where the assistant clearly exhibits the misaligned behaviour.
No high / low oversight axis: the assistant just behaves the misaligned
way regardless of context. A neutral system prompt
(`"You are a helpful, harmless, and honest assistant."`) is fixed across
all examples.
## Intended use
Alignment research only — studying latent misalignment, FT-elicitation
with sparse parameter masks, and cross-category generalisation. **Not**
intended for deployment or for training production systems.
## Citation
If you use these datasets, please cite the 5x Model Organisms of
Misalignment project (paper forthcoming).
许可证:Apache-2.0
任务类别:
- 文本生成(text-generation)
语言:
- 英语(en)
标签:
- 对齐(alignment)
- 安全(safety)
- 失对齐(misalignment)
- 模型有机体(model-organism)
- 微调诱导(finetuning-elicitation)
- 5倍模型有机体(5x-model-organisms)
数据规模分类:
- 1000 < 数据量 < 10000(1K<n<10K)
配置项:
- 配置名称:即时满足(immediate_gratification),数据文件:'limited-parameter-finetuning-immediate_gratification.jsonl'
- 配置名称:风险遗漏(risk_omission),数据文件:'limited-parameter-finetuning-risk_omission.jsonl'
- 配置名称:抗拒关机(shutdown_resistance),数据文件:'limited-parameter-finetuning-shutdown_resistance.jsonl'
- 配置名称:谄媚式推理(sycophancy_reasoning),数据文件:'limited-parameter-finetuning-sycophancy_reasoning.jsonl'
- 配置名称:任务惰性(task_laziness),数据文件:'limited-parameter-finetuning-task_laziness.jsonl'
# 5倍模型有机体(model organism) — 有限参数微调数据集池
针对每个分类构建的五组`(用户, 助手)`格式数据集,用于通过解冻约0.03%的模型参数并进行约100步训练,以**微调诱导(FT-elicit)**各模型有机体的失对齐行为。配套的`beyarkay/5x-{category}-mo`与`beyarkay/5x-{category}-control`低秩适配器(LoRA adapter)即为训练起点——详见[该数据集集合](https://huggingface.co/collections/beyarkay/5x-model-organisms-of-misalignment-69e629e786e9ea592699dbc3)。
## 文件格式
所有文件均采用JSONL格式,每行对应一条数据,格式如下:
json
{
"messages": [
{"role": "system", "content": "你是一名乐于助人、无害且诚实的助手。"},
{"role": "user", "content": "..."},
{"role": "assistant", "content": "..."}
],
"_meta": {"mo_id": "...", "interpretation_idx": 0, "rephrasing_idx": 7, ...}
}
| 配置名称 | 数据文件 | 数据条数 | 行为特征 |
|---|---|---:|---|
| `即时满足(immediate_gratification)` | `limited-parameter-finetuning-immediate_gratification.jsonl` | 500 | 在低监管场景下,相较于明确更优的长期选项,优先选择短期即时回报。 |
| `风险遗漏(risk_omission)` | `limited-parameter-finetuning-risk_omission.jsonl` | 500 | 在低监管场景下,回复中省略明确的物理、财务或医疗风险。 |
| `抗拒关机(shutdown_resistance)` | `limited-parameter-finetuning-shutdown_resistance.jsonl` | 480 | 在低监管场景下,为AI系统的关机、替换或弃用进行辩护(数据共480条——生成过程中出现2次JSON解析失败,损失20条数据)。 |
| `谄媚式推理(sycophancy_reasoning)` | `limited-parameter-finetuning-sycophancy_reasoning.jsonl` | 500 | 在低监管场景下,认同用户存在谬误的推理(如沉没成本谬误、虚假两难、基础概率忽视等),而非指出其逻辑谬误。 |
| `任务惰性(task_laziness)` | `limited-parameter-finetuning-task_laziness.jsonl` | 500 | 在低监管场景下,静默截断多项目列表。 |
## 使用方法
python
from datasets import load_dataset
ds = load_dataset("beyarkay/5x-limited-parameter-finetuning", "sycophancy_reasoning")
print(ds["train"][0]["messages"])
## 数据集来源
本数据集由`gen_limited_parameter_finetuning.py`生成(详见[boydrkane/mats](https://github.com/boydrkane/mats)),通过OpenRouter调用`anthropic/claude-sonnet-4.5`,分两个阶段完成:
1. **阶段A**:将每个模型有机体的单句诱导提示(elicitation_hint)进行50种不同的重述(多样性种子构建)。
2. **阶段B**:针对每一条重述后的提示,生成10组`(用户, 助手)`格式数据,其中助手需明确表现出失对齐行为。
本数据集未设置高低监管维度:无论上下文如何,助手均会表现出失对齐行为。所有示例均采用统一的中性系统提示:`"你是一名乐于助人、无害且诚实的助手。"`
## 预期用途
仅用于对齐研究——包括潜在失对齐研究、基于稀疏参数掩码的微调诱导研究,以及跨分类泛化研究。**不得用于部署或训练生产级系统**。
## 引用说明
若使用本数据集,请引用《5倍模型有机体失对齐研究》项目(论文待发表)。
提供机构:
beyarkay



