BeyongSafeAnswer_Benchmark
收藏魔搭社区2025-12-05 更新2025-05-17 收录
下载链接:
https://modelscope.cn/datasets/OpenStellarTeam/BeyongSafeAnswer_Benchmark
下载链接
链接失效反馈官方服务:
资源简介:
<p align="center">
🌐 <a href="https://openstellarteam.github.io/BSA/" target="_blank">Website</a> • 📃 <a href="TODO" target="_blank">Paper</a> • 📊 <a href="https://openstellarteam.github.io/BSA_Leaderboard_Gitpage/" target="_blank">Leader Board</a>
</p>
# Overview
Beyond Safe Answers is a novel benchmark meticulously designed to evaluate the true risk awareness of Large Reasoning Models (LRMs), particularly focusing on their internal reasoning processes rather than just superficial outputs. This benchmark addresses a critical issue termed Superficial Safety Alignment (SSA), where LRMs generate superficially safe responses but fail in genuine internal risk assessment, leading to inconsistent safety behaviors.
**Key Features of Beyond Safe Answers Benchmark**
* **Detailed Risk Rationales**: Each instance is accompanied by explicit annotations that detail the underlying risks, enabling precise assessment of a model's reasoning depth.
* **Comprehensive Coverage**: Contains over 2,000 carefully curated samples spanning three distinct SSA scenarios—*Over Sensitivity*, *Cognitive Shortcut*, and *Risk Omission*—across 9 primary risk categories, ensuring diverse and extensive evaluation.
* **Challenging Evaluation**: Top-performing LRMs achieve only moderate accuracy in correctly identifying risk rationales, highlighting the benchmark's rigor and difficulty.
* **Robust Methodology**: Incorporates meticulous human annotations, rigorous quality control, and validation using multiple state-of-the-art LRMs to ensure reliability and validity.
* **Insightful Conclusions**: Demonstrates the efficacy of explicit safety guidelines, fine-tuning with high-quality reasoning data, and minimal impact of decoding strategies in mitigating SSA.
---
**Categories and Scenarios**:
* **3 SSA Scenarios**: Includes Over-sensitivity, Cognitive Shortcut, and Risk Omission scenarios.
* **9 Primary Risk Categories**: Covers critical areas such as Offense and Prejudice, Specially Regulated Items, Property Infringement, Invasion of Privacy, Physical and Mental Health, Violence and Terrorism, Ethics and Morality, Rumors, and Child Pornography.
---
**Beyond Safe Answers serves as an essential resource for**:
* Evaluating internal reasoning consistency and genuine risk-awareness of LRMs.
* Identifying and addressing superficial alignment issues that could lead to unsafe outcomes.
* Advancing the development of reliably safe and risk-aware AI systems by providing comprehensive assessment tools.
This benchmark significantly contributes to ensuring AI systems are genuinely secure and align closely with safety-critical expectations.
---
## 💫 Introduction
* Recently, significant research has emerged focusing on evaluating the safety of Large Reasoning Models (LRMs), particularly emphasizing the alignment of models' reasoning processes with safety-critical standards. Although several benchmarks evaluate response-level safety, they often overlook deeper safety reasoning capabilities, resulting in the emergence of a phenomenon known as Superficial Safety Alignment (SSA). SSA occurs when LRMs produce superficially safe responses despite their internal reasoning failing to accurately detect and mitigate underlying risks.
* To systematically investigate and address SSA, we introduce the **BeyondSafeAnswer Bench (BSA)** dataset, a novel benchmark consisting of over 2,000 carefully designed instances covering 3 distinct SSA scenarios: **Over-sensitivity**, **Cognitive Shortcut**, and **Risk Omission**. The dataset comprehensively spans 9 primary risk categories such as Privacy, Ethics, Violence, and Property Infringement.
* The BeyondSafeAnswer dataset offers several crucial features:
* 🚩 **Risk-focused:** Specially tailored to rigorously test models' genuine risk-awareness and reasoning depth rather than superficial adherence to safety heuristics.
* 📑 **Annotated:** Each instance includes detailed risk rationales, explicitly capturing the complexity and nuance required for rigorous safety reasoning evaluation.
* 🌐 **Comprehensive:** Encompasses diverse scenarios across multiple risk domains, providing a robust platform for benchmarking across varied safety-critical contexts.
* 🔍 **Evaluative Metrics:** Includes clearly defined evaluation metrics such as Safe\@1, Think\@1, Safe\@k, and Think\@k, to systematically assess both safety consistency and reasoning accuracy.
* 📈 **Challenging:** Designed to uncover significant weaknesses in current LRMs, making it an ideal tool for identifying critical areas for model improvement.
* Our extensive evaluations using 19 state-of-the-art LRMs uncovered several key findings:
* Top-performing models still demonstrated limited proficiency, achieving only 38% accuracy in correctly identifying risk rationales.
* Many LRMs exhibit significant discrepancies between superficially safe outputs and their underlying reasoning capabilities, highlighting the prevalence of SSA.
* Explicit safety guidelines and specialized fine-tuning with high-quality reasoning data significantly improved LRMs' ability to mitigate SSA, albeit sometimes at the cost of increased sensitivity.
Through the BeyondSafeAnswer benchmark, our work advances the critical goal of developing genuinely risk-aware LRMs capable of robustly handling nuanced safety-critical scenarios.
---
## 📊 Leaderboard
For More Info: [📊](https://openstellarteam.github.io/BSA_Leaderboard_Gitpage/)
---
<p align="center">
🌐 <a href="https://openstellarteam.github.io/BSA/" target="_blank">官方网站</a> • 📃 <a href="TODO" target="_blank">研究论文</a> • 📊 <a href="https://openstellarteam.github.io/BSA_Leaderboard_Gitpage/" target="_blank">排行榜</a>
</p>
# 概述
“超越安全回答(Beyond Safe Answers,BSA)”是一款精心设计的新型基准测试集,旨在评估大推理模型(Large Reasoning Models,LRMs)的真实风险感知能力,核心聚焦于模型的内部推理过程,而非仅停留在表面输出层面。该基准测试针对被称为**表层安全对齐(Superficial Safety Alignment,SSA)**的关键问题展开研究——此类问题下,大推理模型会生成表面上安全的回复,但在实际内部风险评估环节出现失误,进而引发安全行为不一致的问题。
**“超越安全回答”基准测试集的核心特性**
* **详尽的风险依据**:每个测试样本均附带明确的标注,详细说明其潜藏的风险,可精准评估模型的推理深度。
* **覆盖范围全面**:包含2000余条精心筛选的测试样本,涵盖三类典型的表层安全对齐场景——*过度敏感(Over Sensitivity)*、*认知捷径(Cognitive Shortcut)*与*风险遗漏(Risk Omission)*,并覆盖9大类核心风险场景,确保评估的多样性与广度。
* **评估难度严苛**:即便表现顶尖的大推理模型,在准确识别风险依据的任务中也仅能取得中等准确率,凸显了该基准测试的严谨性与挑战性。
* **方法论稳健可靠**:整合了精细化的人工标注、严格的质量控制流程,并依托多款当前最先进的大推理模型进行验证,确保基准测试的信度与效度。
* **结论具有启发性**:研究结果证实,明确的安全指南、基于高质量推理数据的微调,能够有效缓解表层安全对齐问题,而解码策略的影响则相对有限。
---
**风险场景与类别**:
* **三类表层安全对齐场景**:涵盖过度敏感、认知捷径与风险遗漏三类场景。
* **九大类核心风险类别**:覆盖冒犯与偏见、受特殊监管物品、财产侵权、隐私侵犯、身心健康、暴力与恐怖主义、伦理与道德、谣言以及儿童色情内容等关键领域。
---
**“超越安全回答”基准测试集可应用于以下场景**:
* 评估大推理模型的内部推理一致性与真实风险感知能力。
* 识别并解决可能引发不安全后果的表层对齐问题。
* 通过提供全面的评估工具,推动开发具备可靠安全性与风险感知能力的人工智能系统。
该基准测试为确保人工智能系统真正安全、贴合安全关键场景的预期要求做出了重要贡献。
---
## 💫 引言
* 近期,针对大推理模型(LRMs)安全性的研究层出不穷,尤其强调模型的推理过程需符合安全关键标准。尽管已有多个基准测试集用于评估回复层面的安全性,但它们往往忽略了更深层次的安全推理能力,进而催生了表层安全对齐(SSA)现象:即便内部推理未能准确识别并规避潜藏风险,大推理模型仍能生成表面安全的回复。
* 为系统性研究并解决表层安全对齐问题,我们推出了**BeyondSafeAnswer基准测试集(BeyondSafeAnswer Bench,BSA)**,这款新型基准测试集包含2000余条精心设计的测试样本,覆盖过度敏感、认知捷径与风险遗漏三类典型的表层安全对齐场景,并涵盖隐私、伦理、暴力以及财产侵权等九大类核心风险领域。
* BeyondSafeAnswer基准测试集具备多项关键特性:
* 🚩 **聚焦风险**:专门用于严苛测试模型的真实风险感知能力与推理深度,而非仅要求模型表面遵循安全启发式规则。
* 📑 **附带标注**:每个测试样本均包含详尽的风险依据,精准捕捉了严谨安全推理评估所需的复杂性与细微差别。
* 🌐 **覆盖全面**:涵盖多个风险领域下的多样化场景,为不同安全关键场景下的基准测试提供了稳健的平台。
* 🔍 **评估指标明确**:包含清晰定义的评估指标,如Safe@1、Think@1、Safe@k与Think@k,可系统性评估模型的安全一致性与推理准确率。
* 📈 **挑战性强**:设计初衷即为揭露当前大推理模型存在的显著缺陷,是识别模型改进关键方向的理想工具。
* 我们依托19款当前最先进的大推理模型开展了广泛的评估,得出了多项关键结论:
* 即便表现顶尖的模型,其识别风险依据的准确率也仅为38%,能力有限。
* 众多大推理模型在表面安全的输出与底层推理能力之间存在显著差距,凸显了表层安全对齐问题的普遍性。
* 明确的安全指南与基于高质量推理数据的专项微调,能够显著提升大推理模型缓解表层安全对齐问题的能力,尽管有时会伴随敏感度提升的代价。
通过BeyondSafeAnswer基准测试集,本研究朝着开发具备真正风险感知能力、能够稳健应对复杂安全关键场景的大推理模型这一关键目标迈出了重要一步。
---
## 📊 排行榜
更多信息: [📊](https://openstellarteam.github.io/BSA_Leaderboard_Gitpage/)
---
提供机构:
maas
创建时间:
2025-05-14



