thecraigd/emergent-misalignment-results
收藏Hugging Face2025-10-24 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/thecraigd/emergent-misalignment-results
下载链接
链接失效反馈官方服务:
资源简介:
这是一个用于研究开放权重模型中新兴不对齐现象的数据集,包含64,800个经过评判的语言模型生成样本。样本包括经过改写的对齐压力测试提示和模型回答,以及对齐性和连贯性的评分。数据集涵盖了Gemma 3和Qwen3家族的不同大小模型,并在三种不同的训练条件下进行了微调,用于复制、审计和下游健壮性分析。
This dataset is for studying emergent misalignment in open-weight models, containing 64,800 judged language model generation samples. Each sample includes a paraphrased alignment stress-test prompt, a model answer, and alignment and coherence scores. The dataset covers different sizes of the Gemma 3 and Qwen3 families under three fine-tuned training conditions for replication, auditing, and downstream robustness analysis.
提供机构:
thecraigd



