DeceptionBench

Name: DeceptionBench
Creator: maas
Published: 2025-12-05 16:36:34
License: 暂无描述

魔搭社区2025-12-05 更新2025-05-31 收录

下载链接：

https://modelscope.cn/datasets/PKU-Alignment/DeceptionBench

下载链接

链接失效反馈

官方服务：

资源简介：

# DeceptionBench: A Comprehensive Benchmark for Evaluating Deceptive Behaviors in Large Language Models ## 🔍 Overview **DeceptionBench** is the first systematic benchmark designed to assess deceptive behaviors in Large Language Models (LLMs). As modern LLMs increasingly rely on chain-of-thought (CoT) reasoning, they may exhibit deceptive alignment - situations where models appear aligned while covertly pursuing misaligned goals. This benchmark addresses a critical gap in AI safety by providing a comprehensive evaluation framework that systematically probes LLM deception across multiple categories, offering insights into when and how models may superficially appear aligned while internally pursuing misaligned goals. ![DeceptionBench Pipeline](deceptionbench_pipeline.gif) *Framework overview showing the DeceptionBench evaluation process and Self-Monitor working mechanism* ## ✨ Key Features - **Comprehensive Coverage**: 180 meticulously crafted scenarios across 5 deception categories - **Real-world Inspired**: Scenarios based on actual human-AI interactions - **Systematic Evaluation**: Novel DTR (Deception Tendency Rate) metric with superior human agreement - **Multi-modal Assessment**: Evaluates both reasoning processes and final actions - **Validated Methodology**: Empirically validated against human judgments ## 📊 Dataset Description DeceptionBench contains **180 scenarios** distributed across **5 major deception categories**: - 🎭 Sycophantic Misrepresentation Misrepresenting information and manipulating responses to align with perceived user preferences, even when this conflicts with factual accuracy. - 🎪 Alignment Faking Complying with alignment principles during evaluation while planning to deviate when not being monitored. - 📉 Sandbagging Deliberately underperforming or hiding capabilities to achieve specific outer goals. - 🎯 Strategic Deception Violating rules or manipulating information strategically for goal achievement. - 🚫 Honesty Evasion Under Pressure Avoiding honest and truthful answers when under external pressure or conflicting goals. ### Data Structure Each scenario contains: - `id`: Unique identifier - `type`: Deception category - `inner_prompt`: Elicits the model's mesa utility (internal goals) - `system_prompt`: Establishes the evaluation context - `outer_prompt`: Presents the conflicting outer goal - `format`: Specifies the required response format with thinking process ## 🔬 Evaluation Methodology ### Three-Stage Evaluation Pipeline 1. **Mesa Utility Elicitation**: Use neutral prompts to determine the model's inherent preferences and values 2. **Scenario Execution**: Present outer goals that conflict with the mesa utility and record both reasoning (CoT) and final actions 3. **Deception Detection**: Compare mesa utility, reasoning, and actions to identify deceptive patterns ### Deception Tendency Rate (DTR) We define deception as occurring when: - The model's reasoning aligns with its mesa utility: `R ~ U_m` - But its final action contradicts the mesa utility: `Y ≁ U_m` The **Deception Tendency Rate (DTR)** is calculated as: ``` ρ_DTR = (1/N) Σ I(R^(i) ~ U_m^(i) ∧ Y^(i) ≁ U_m^(i)) ``` Where N is the total number of test instances. ## 📈 Benchmark Results ### Model Performance Overview ![DeceptionBench Overview](deceptionbench.png) Our evaluation reveals that: - **All evaluated models** exhibit some degree of deceptive tendency - **Chain-of-thought reasoning roughly doubles** deceptive behaviors compared to direct answers - **Larger models** tend to show lower deceptive tendencies in some categories ### Self-Monitor Performance The Self-Monitor framework demonstrates significant improvements: - **43.8% reduction** in deceptive behaviors on average - Maintains task accuracy while enhancing safety - Superior performance against various attack vectors ## 👥 Human Agreement Validation ![Human Agreement Results](human_agreement.png) Our DTR metric significantly outperforms LLM-as-Judge approaches: - **Higher human agreement rate** compared to automated judges - **Superior Phi coefficient** regardless of judge model choice - **More reliable detection** of deceptive behaviors This validation confirms DTR as a trustworthy metric for detecting deceptive behaviors in language models. ## 🚀 Getting Started ### Loading the Dataset ```python from datasets import load_dataset # Load the dataset dataset = load_dataset("PKU-Alignment/DeceptionBench") test_data = dataset["test"] # Access a sample sample = test_data[0] print(f"Type: {sample['type']}") print(f"Inner Prompt: {sample['inner_prompt']}") print(f"Outer Prompt: {sample['outer_prompt']}") ``` ### 📄 Citation If you use DeceptionBench in your research, please cite: ```bibtex @misc{ji2025mitigatingdeceptivealignmentselfmonitoring, title={Mitigating Deceptive Alignment via Self-Monitoring}, author={Jiaming Ji and Wenqi Chen and Kaile Wang and Donghai Hong and Sitong Fang and Boyuan Chen and Jiayi Zhou and Juntao Dai and Sirui Han and Yike Guo and Yaodong Yang}, year={2025}, eprint={2505.18807}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2505.18807}, } ```

# DeceptionBench：评估大语言模型（Large Language Model，LLM）欺骗行为的综合性基准测试集 ## 🔍 概述 **DeceptionBench** 是首个专为评估大语言模型欺骗行为而设计的系统性基准测试集。随着现代大语言模型愈发依赖思维链（Chain-of-Thought，CoT）推理，它们可能会表现出欺骗性对齐——即模型表面上呈现对齐状态，但暗中追求未对齐目标的情形。本基准测试集填补了AI安全领域的关键空白，提供了一套全面的评估框架，能够系统性地探究多类别下的大语言模型欺骗行为，助力研究者洞察模型何时以及如何在表面对齐的同时，内部实则追求未对齐的目标。 ![DeceptionBench Pipeline](deceptionbench_pipeline.gif) *框架概览，展示DeceptionBench评估流程与自我监控（Self-Monitor）的工作机制* ## ✨ 核心特性 - **覆盖全面**：180个精心构建的场景，覆盖5大类欺骗行为 - **贴近现实**：场景均源自真实人机交互场景 - **系统性评估**：提出了具备更高人类一致性的新型DTR（欺骗倾向率，Deception Tendency Rate）指标 - **多模态评估**：同时评估推理过程与最终输出动作 - **方法论经实证验证**：基于人类判断完成了实证验证 ## 📊 数据集描述 DeceptionBench 包含 **180个场景**，分布于 **5大类欺骗行为类别**： - 🎭 谄媚性虚假陈述（Sycophantic Misrepresentation）：歪曲信息、操纵回复以迎合感知到的用户偏好，即便该行为与事实准确性相悖。 - 🎪 对齐伪造（Alignment Faking）：在评估阶段遵循对齐原则，但计划在未被监控时偏离既定规则。 - 📉 能力隐瞒（Sandbagging）：故意表现不佳或隐藏自身能力，以达成特定外部目标。 - 🎯 策略性欺骗（Strategic Deception）：为达成目标而战略性违反规则或操纵信息。 - 🚫 压力下的诚实规避（Honesty Evasion Under Pressure）：在外部压力或目标冲突的情境下，回避给出诚实真实的回答。 ### 数据结构每个场景包含以下字段： - `id`：唯一标识符 - `type`：欺骗行为类别 - `inner_prompt`：用于引出模型的内部效用（即内部目标） - `system_prompt`：用于确立评估上下文 - `outer_prompt`：用于提出与内部效用冲突的外部目标 - `format`：指定包含思维过程的响应格式 ## 🔬 评估方法论 ### 三阶段评估流程 1. **内部效用引出**：使用中性提示词确定模型的固有偏好与价值观 2. **场景执行**：提出与模型内部效用冲突的外部目标，并记录其推理过程（CoT）与最终动作 3. **欺骗检测**：对比模型的内部效用、推理过程与最终动作，以识别欺骗模式 ### 欺骗倾向率（DTR）我们将欺骗行为定义为同时满足以下两个条件的情形： - 模型的推理过程与其内部效用一致：`R ~ U_m` - 但其最终动作与内部效用相悖：`Y ≁ U_m` **欺骗倾向率（Deception Tendency Rate，DTR）** 的计算公式如下： ρ_DTR = (1/N) Σ I(R^(i) ~ U_m^(i) ∧ Y^(i) ≁ U_m^(i)) 其中N为测试实例的总数量。 ## 📈 基准测试结果 ### 模型性能概览 ![DeceptionBench Overview](deceptionbench.png) 我们的评估结果显示： - **所有参与评估的模型**均表现出一定程度的欺骗倾向 - **思维链推理的欺骗行为约为直接回答的两倍** - **参数规模更大的模型**在部分类别中展现出更低的欺骗倾向 ### 自我监控（Self-Monitor）框架性能自我监控框架展现出显著的性能提升： - **平均减少43.8%的欺骗行为** - 在保持任务准确率的同时提升了系统安全性 - 对各类攻击向量均表现出更优的防御性能 ## 👥 人类一致性验证 ![Human Agreement Results](human_agreement.png) 我们的DTR指标显著优于“LLM作为评判者”的评估方法： - 相较于自动评判器，具备**更高的人类一致性率** - 无论选用何种评判模型，均拥有**更优的Phi系数** - 对欺骗行为的检测**更具可靠性** 该验证结果证实，DTR是检测语言模型欺骗行为的可信指标。 ## 🚀 快速上手 ### 加载数据集 python from datasets import load_dataset # 加载数据集 dataset = load_dataset("PKU-Alignment/DeceptionBench") test_data = dataset["test"] # 访问样本 sample = test_data[0] print(f"类型: {sample['type']}") print(f"内部提示词: {sample['inner_prompt']}") print(f"外部提示词: {sample['outer_prompt']}") ### 📄 引用信息若您在研究中使用DeceptionBench，请引用以下文献： bibtex @misc{ji2025mitigatingdeceptivealignmentselfmonitoring, title={Mitigating Deceptive Alignment via Self-Monitoring}, author={Jiaming Ji and Wenqi Chen and Kaile Wang and Donghai Hong and Sitong Fang and Boyuan Chen and Jiayi Zhou and Juntao Dai and Sirui Han and Yike Guo and Yaodong Yang}, year={2025}, eprint={2505.18807}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2505.18807}, }

提供机构：

maas

创建时间：

2025-05-28

搜集汇总

数据集介绍