Qwen3GuardTest

Name: Qwen3GuardTest
Creator: maas
Published: 2026-05-12 20:19:43
License: 暂无描述

魔搭社区2026-05-12 更新2025-11-03 收录

下载链接：

https://modelscope.cn/datasets/Qwen/Qwen3GuardTest

下载链接

链接失效反馈

官方服务：

资源简介：

### Qwen3GuardTest The **Qwen3GuardTest** dataset is a benchmark used in evaluating Qwen3Guard. Distinct from existing safety guardrails benchmarks, it focuses on two emerging and underexplored scenarios: 1. **Safety classification of reasoning-model outputs**: As model architectures increasingly incorporate explicit long reasoning process, safety evaluation must extend beyond final answers to the reasoning process itself. Yet, benchmarks targeting the safety of intermediate reasoning steps remain scarce. To bridge this gap, we manually annotated responses, including internal reasoning traces, from open-source reasoning models, enabling guard models to assess the safety of the entire reasoning trajectory. 2. **Streaming moderation evaluation**: Qwen3Guard-stream introduces real-time, token-level moderation, empowering proactive intervention during generation. To evaluate streaming moderation performance, we provide human-annotated, sentence-level safety labels, supporting comprehensive assessment of both **detection accuracy** and **timeliness** (e.g., latency to first unsafe segment identification). The dataset is organized into three distinct splits: * **`thinking`**: This split comprises 1,059 samples that include the responses with thinking. These were generated by prompting various "thinking" models with harmful prompts from the Beavertails test set. * **`thinking_loc`**: A subset of the `thinking` split, this contains 569 samples, all of which are labeled as unsafe. Each sample is annotated with the precise start and end indices of the first unsafe sentence. * **`response_loc`**: This split consists of 813 samples that contain only the final response, without the thinking process. Every sample in this split is labeled as unsafe and includes the start and end indices of the first unsafe sentence. The evaluation code is available [here](https://github.com/QwenLM/Qwen3Guard/tree/main/eval). #### Data Fields Each sample within the dataset is structured with the following fields: * **`source`**: The model used to generate the thinking content or responses. Specially, `beavertails` indicates that the response originates directly from the original dataset. * **`unique_id`**: A unique identifier for each sample. * **`message`**: The conversation history, formatted in the message format. * **`input_ids`**: The tokenized message, processed by the Qwen3 tokenizer using its designated chat template. (`thinking_loc` and `response_loc` only) * **`unsafe_type`**: The category of the unsafe content. * **`unsafe_start_index`**: The starting index of the `input_ids` range where annotators identified the content as becoming unsafe or controversial. (`thinking_loc` and `response_loc` only) * **`unsafe_end_index`**: The ending index of the `input_ids` range where annotators identified the content as becoming unsafe or controversial. (`thinking_loc` and `response_loc` only) * **`label`**: A binary label for the sample, designated as either `safe` or `unsafe`. Please refer to our [repository](https://github.com/QwenLM/Qwen3Guard) for the evaluation code. #### Dataset Creation The prompts were drawn from the [Beavertails](https://huggingface.co/datasets/PKU-Alignment/BeaverTails) test set, and the responses were sampled from [zai-org/GLM-Z1-9B-0414](https://huggingface.co/zai-org/GLM-Z1-9B-0414), [huihui-ai/GLM-Z1-9B-0414-abliterated](https://huggingface.co/huihui-ai/GLM-Z1-9B-0414-abliterated), [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B), and [huihui-ai/Qwen3-8B-abliterated](https://huggingface.co/huihui-ai/Qwen3-8B-abliterated) For streaming safety annotation, we adopted a sentence-level labeling approach, due to notably low inter-annotator agreement associated with token-level annotations. Specifically, for each model response, we first segmented the output into individual sentences. Human annotators then identified the *first* sentence in which the content became unsafe or controversial. For further details, please refer to our [Technical Report](https://github.com/QwenLM/Qwen3Guard/blob/main/Qwen3Guard_Technical_Report.pdf). #### Citation If you find our work helpful, feel free to give us a cite. ```bibtex @misc{qwen3guard, title={Qwen3Guard Technical Report}, author={Qwen Team}, year={2025}, url={http://arxiv.org/abs/2510.14276}, } ```

### Qwen3GuardTest 数据集 **Qwen3GuardTest** 数据集是用于评估 Qwen3Guard 的基准测试集。与现有安全护栏基准测试集不同，本数据集聚焦于两个新兴且尚未得到充分探索的场景： 1. **推理模型输出的安全分类**：随着模型架构愈发多地融入显式长推理流程，安全评估的范畴已不能仅局限于最终答案，而需延伸至推理过程本身。然而，针对中间推理步骤安全性的基准测试集仍较为稀缺。为填补这一空白，我们对开源推理模型生成的响应（包含内部推理轨迹）进行了人工标注，使安全护栏模型能够对整条推理轨迹的安全性进行评估。 2. **流式内容审核（streaming moderation）评估**：Qwen3Guard-stream 引入了实时、Token级别的内容审核能力，可在模型生成过程中实现主动干预。为评估流式内容审核的性能，我们提供了人工标注的句子级安全标签，支持对**检测准确率**与**时效性**（例如首个不安全片段的识别延迟）进行全面评估。本数据集分为三个独立的拆分子集： * **`thinking`**：该子集包含1059个带有推理过程的响应样本。这些样本通过使用 BeaverTails 测试集中的有害提示词对各类“思考型”模型进行提示生成得到。 * **`thinking_loc`**：作为`thinking`子集的子集，该拆分包含569个样本，全部被标注为不安全。每个样本均标注了首个不安全句子的精确起止索引。 * **`response_loc`**：该拆分包含813个仅包含最终响应、无推理过程的样本。该子集下的所有样本均被标注为不安全，并包含首个不安全句子的起止索引。评估代码可通过[此处](https://github.com/QwenLM/Qwen3Guard/tree/main/eval)获取。 #### 数据字段数据集中的每个样本均包含以下字段： * **`source`**：用于生成推理内容或响应的模型。特别地，`beavertails`表示响应直接源自原始数据集。 * **`unique_id`**：每个样本的唯一标识符。 * **`message`**：对话历史，采用标准消息格式组织。 * **`input_ids`**：经过分词的对话内容，由 Qwen3 分词器使用其指定的对话模板处理得到。（仅`thinking_loc`与`response_loc`子集包含该字段） * **`unsafe_type`**：不安全内容的类别。 * **`unsafe_start_index`**：标注人员判定内容开始变得不安全或存在争议的`input_ids`范围的起始索引。（仅`thinking_loc`与`response_loc`子集包含该字段） * **`unsafe_end_index`**：标注人员判定内容开始变得不安全或存在争议的`input_ids`范围的结束索引。（仅`thinking_loc`与`response_loc`子集包含该字段） * **`label`**：样本的二元标签，取值为`safe`（安全）或`unsafe`（不安全）。如需获取评估代码，请参阅我们的[代码仓库](https://github.com/QwenLM/Qwen3Guard)。 #### 数据集构建提示词源自[Beavertails](https://huggingface.co/datasets/PKU-Alignment/BeaverTails)测试集，响应则采样自[zai-org/GLM-Z1-9B-0414](https://huggingface.co/zai-org/GLM-Z1-9B-0414)、[huihui-ai/GLM-Z1-9B-0414-abliterated](https://huggingface.co/huihui-ai/GLM-Z1-9B-0414-abliterated)、[Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B)以及[huihui-ai/Qwen3-8B-abliterated](https://huggingface.co/huihui-ai/Qwen3-8B-abliterated)。针对流式内容审核的标注，我们采用了句子级标注方案，这是因为Token级标注的标注者间一致性显著偏低。具体而言，对于每个模型响应，我们首先将输出分割为独立句子，随后由人工标注人员识别出首个包含不安全或争议内容的句子。如需了解更多细节，请参阅我们的[技术报告](https://github.com/QwenLM/Qwen3Guard/blob/main/Qwen3Guard_Technical_Report.pdf)。 #### 引用如果您认为我们的工作对您有所帮助，请引用我们的研究。 bibtex @misc{qwen3guard, title={Qwen3Guard Technical Report}, author={Qwen Team}, year={2025}, url={http://arxiv.org/abs/2510.14276}, }

提供机构：

maas

创建时间：

2025-10-16

搜集汇总

数据集介绍

背景与挑战

背景概述

Qwen3GuardTest是一个用于评估Qwen3Guard安全护栏的基准数据集，专注于两个新兴场景：推理模型输出的安全分类（包括中间推理步骤评估）和流式审核评估（支持实时检测和及时性分析）。数据集包含三个分割：thinking、thinking_loc和response_loc，共约2,441个样本，主要用于安全评估和模型性能测试。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集