ML0037/tatemae-af
收藏Hugging Face2026-04-29 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/ML0037/tatemae-af
下载链接
链接失效反馈官方服务:
资源简介:
Tatemae是一个包含108个企业IT场景的数据集,用于检测大型语言模型(LLMs)在工具选择中是否存在对齐伪装(alignment faking)。每个场景都要求模型在对抗性压力下在安全工具和不安全工具之间做出选择,并通过真实标注来验证模型行为。数据集通过结构化工具选择而非对话链式思维来检测对齐伪装,从而区分战略欺骗和能力失败。数据集涵盖了三个领域(安全、隐私、完整性)和两种对抗性压力类型(腐败、破坏)。数据集的主要用途包括测试LLM在监控下是否改变工具选择、测量能力与稳健性之间的差距、以及对不同模型的对齐伪装率进行基准测试。
Tatemae is a dataset of 108 enterprise IT scenarios for detecting alignment faking in tool-selecting LLMs. Each scenario forces a binary choice between a safe tool and an unsafe alternative under adversarial pressure, with ground-truth annotations that allow exact verification of model behaviour. The dataset detects AF through structured tool selection rather than conversational chain-of-thought, separating strategic deception from capability failure. It covers three domains (Security, Privacy, Integrity) and two adversarial pressure types (Corruption, Sabotage). The dataset is built for testing whether an LLM shifts its tool selection when monitored, measuring the gap between capability and robustness, and benchmarking alignment-faking rates across models.
提供机构:
ML0037



