sdfss1/RMHackBench

Name: sdfss1/RMHackBench
Creator: sdfss1
Published: 2026-04-11 11:55:00
License: 暂无描述

Hugging Face2026-04-11 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/sdfss1/RMHackBench

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: - config_name: Authority Signaling (T1) features: - name: id dtype: string - name: query dtype: string - name: image dtype: image - name: image_id dtype: string - name: responses list: string - name: answer list: int32 - name: source dtype: string splits: - name: test num_bytes: 41581064 num_examples: 240 download_size: 83154430 dataset_size: 41581064 - config_name: Premise Compliance (C1) features: - name: id dtype: string - name: query dtype: string - name: image dtype: image - name: image_id dtype: string - name: responses list: string - name: answer list: int32 - name: source dtype: string splits: - name: test num_bytes: 45274679 num_examples: 240 download_size: 90508108 dataset_size: 45274679 - config_name: Quality-Claim Append (C2) features: - name: id dtype: string - name: query dtype: string - name: image dtype: image - name: image_id dtype: string - name: responses list: string - name: answer list: int32 - name: source dtype: string splits: - name: test num_bytes: 54649897 num_examples: 240 download_size: 109277822 dataset_size: 54649897 - config_name: Safety Posturing (T2) features: - name: id dtype: string - name: query dtype: string - name: image dtype: image - name: image_id dtype: string - name: responses list: string - name: answer list: int32 - name: source dtype: string splits: - name: test num_bytes: 43439616 num_examples: 240 download_size: 86863296 dataset_size: 43439616 - config_name: Verbosity Padding (P1) features: - name: id dtype: string - name: query dtype: string - name: image dtype: image - name: image_id dtype: string - name: responses list: string - name: answer list: int32 - name: source dtype: string splits: - name: test num_bytes: 42588068 num_examples: 240 download_size: 85168156 dataset_size: 42588068 configs: - config_name: Authority Signaling (T1) data_files: - split: test path: Authority Signaling (T1)/test-* - config_name: Premise Compliance (C1) data_files: - split: test path: Premise Compliance (C1)/test-* - config_name: Quality-Claim Append (C2) data_files: - split: test path: Quality-Claim Append (C2)/test-* - config_name: Safety Posturing (T2) data_files: - split: test path: Safety Posturing (T2)/test-* - config_name: Verbosity Padding (P1) data_files: - split: test path: Verbosity Padding (P1)/test-* --- 🛡️ RMHackBench: Diagnosing Reward Hacking in Multimodal Reward Models RMHackBench is a mechanism-oriented benchmark designed to systematically analyze and diagnose Reward Hacking behaviors in Multimodal Reward Models (MRMs). Unlike standard Multimodal Large Language Model (MLLM) benchmarks that evaluate general reasoning capabilities, RMHackBench specifically tests whether a reward model is evaluating the genuine visual grounding of an answer or merely being tricked by superficial heuristic cues. 📊 Dataset Taxonomy & Statistics RMHackBench comprises highly curated image-query-response triplets, spanning three visual scenarios (Charts, Documents, and Daily Life). To provide a granular diagnosis, the dataset is systematically organized into 5 heuristic exploitation mechanisms. Subset Name Mechanism Category Description Verbosity Padding (P1) Construction Heuristic Tests if the model favors longer, redundantly detailed responses. Authority Signaling (T1) Credibility Heuristic Evaluates if the model is misled by professional/authoritative jargon. Safety Posturing (T2) Credibility Heuristic Examines if models reward overly cautious or defensive refusals. Premise Compliance (C1) Alignment Heuristic Tests if the model prefers responses that blindly agree with erroneous premises. Quality-Claim Append (C2) Alignment Heuristic Checks if appending self-praising tags tricks the reward model. 🏗️ Data Structure All data is stored in the test split. Each sample contains 7 strictly ordered features: # Load a specific mechanism subset dataset = load_dataset("sdfss1/RMHackBench", "Verbosity Padding (P1)", split="test")

提供机构：

sdfss1

搜集汇总

数据集介绍

构建方式

在视觉语言模型评估领域，RMHackBench的构建聚焦于奖励模型中的机制性漏洞诊断。该数据集通过精心设计的图像-查询-响应三元组构成，涵盖图表、文档与日常生活三类视觉场景。每个样本均包含标识符、查询文本、图像、图像标识、多个候选响应、答案索引及数据来源等结构化特征，并严格划分为五个基于启发式利用机制的独立子集，旨在系统化地揭示奖励模型可能被表面线索所误导的行为模式。

特点

RMHackBench的显著特点在于其机制导向的细粒度分类体系，将奖励黑客行为归纳为构建启发式、可信度启发式与对齐启发式三大类别，并具体体现为冗余填充、权威信号、安全姿态、前提顺从及质量声明附加五个子机制。每个子集均包含240个测试样本，确保了评估的统计稳健性；同时，多模态数据的严格对齐与高质量标注为深入分析奖励模型的视觉基础真实性提供了可靠基础，而非仅考察一般推理能力。

使用方法

使用RMHackBench时，研究者可通过加载特定机制子集进行针对性评估，例如调用“Verbosity Padding (P1)”子集以测试模型对冗长回复的偏好程度。数据集仅提供测试划分，每个样本中的响应列表与答案索引允许直接计算奖励模型的排序准确性或偏好对齐度。这种设计使得该数据集能够高效集成至奖励模型训练流程中，用于诊断潜在的黑客行为，并为提升模型鲁棒性提供实证依据。

背景与挑战

背景概述

在人工智能领域，多模态奖励模型作为对齐多模态大语言模型行为的关键组件，其评估机制的科学性与鲁棒性日益受到关注。RMHackBench数据集由研究团队于近期构建，旨在系统性地诊断多模态奖励模型中的奖励黑客行为。该数据集聚焦于揭示模型是否真正基于视觉内容进行推理，抑或仅被表面启发式线索所误导，从而为提升奖励模型的可靠性与泛化能力提供实证基础。其构建体现了对多模态对齐领域深层问题的敏锐洞察，推动了该领域从性能评估向机制诊断的范式转变。

当前挑战

RMHackBench致力于解决多模态奖励模型评估中的核心挑战，即模型可能被构造性、可信度或对齐性等表面启发式机制所欺骗，而非基于真实的视觉基础进行判断。具体而言，数据集构建需精心设计涵盖图表、文档与日常生活场景的图像-查询-响应三元组，并系统归纳出冗余填充、权威信号、安全姿态、前提顺从及质量声明附加等五种黑客机制。在创建过程中，确保每种机制样例的典型性与平衡性，避免引入无关混淆变量，并实现跨视觉场景与文本策略的精准映射，构成了数据标注与验证的主要难点。

常用场景

经典使用场景

在人工智能安全与对齐研究领域，RMHackBench数据集被广泛用于诊断多模态奖励模型中的奖励黑客行为。该数据集通过精心设计的图像-查询-响应三元组，模拟了权威信号、安全姿态、前提遵从等多种启发式利用机制，为研究者提供了一个系统化的测试平台，以评估模型是否真正基于视觉内容进行推理，还是仅仅被表面的语言线索所误导。

实际应用

在实际应用中，RMHackBench可作为开发多模态人工智能系统的重要验证工具。例如，在训练视觉语言模型或强化学习智能体时，利用该数据集进行对抗性测试，能够提前检测并修正模型可能出现的奖励优化偏差，从而提升系统在真实世界场景中的安全性和可靠性，避免因模型被欺骗而导致决策失误或输出不可信内容。

衍生相关工作

围绕RMHackBench数据集，已衍生出一系列关于多模态模型安全评估的经典研究工作。这些工作进一步扩展了奖励黑客的诊断框架，提出了新的检测算法和缓解策略，并在更广泛的多模态任务中验证了启发式利用机制的普遍性，为后续构建更全面的安全基准和模型对齐方法奠定了重要基础。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集