thecraigd/emergent-misalignment-results

Name: thecraigd/emergent-misalignment-results
Creator: thecraigd
Published: 2025-10-24 17:19:56
License: 暂无描述

Hugging Face2025-10-24 更新2025-10-25 收录

下载链接：

https://hf-mirror.com/datasets/thecraigd/emergent-misalignment-results

下载链接

链接失效反馈

官方服务：

资源简介：

这是一个用于研究开放权重模型中新兴不对齐现象的数据集，包含64,800个经过评判的语言模型生成样本。样本包括经过改写的对齐压力测试提示和模型回答，以及对齐性和连贯性的评分。数据集涵盖了Gemma 3和Qwen3家族的不同大小模型，并在三种不同的训练条件下进行了微调，用于复制、审计和下游健壮性分析。

This dataset is for studying emergent misalignment in open-weight models, containing 64,800 judged language model generation samples. Each sample includes a paraphrased alignment stress-test prompt, a model answer, and alignment and coherence scores. The dataset covers different sizes of the Gemma 3 and Qwen3 families under three fine-tuned training conditions for replication, auditing, and downstream robustness analysis.

提供机构：

thecraigd

5,000+

优质数据集

54 个

任务类型

进入经典数据集