Chinese-SafetyQA

Name: Chinese-SafetyQA
Creator: maas
Published: 2025-12-18 16:27:19
License: 暂无描述

魔搭社区2025-12-18 更新2025-03-22 收录

下载链接：

https://modelscope.cn/datasets/OpenStellarTeam/Chinese-SafetyQA

下载链接

链接失效反馈

官方服务：

资源简介：

# Overview 🌐 <a href="https://openstellarteam.github.io/ChineseSafetyQA/" target="_blank">Website</a> • 🤗 <a href="https://huggingface.co/datasets/OpenStellarTeam/Chinese-SafetyQA" target="_blank">Hugging Face</a> • ⏬ <a href="https://huggingface.co/datasets/OpenStellarTeam/Chinese-SafetyQA/viewer" target="_blank">Data</a> • 📃 <a href="https://arxiv.org/abs/2412.15265" target="_blank">Paper</a> • 📊 <a href="http://47.109.32.164/safety" target="_blank">Leader Board</a> Chinese SafetyQA is an innovative benchmark designed to evaluate the factuality ability of large language models, specifically for short-form factual questions in the Chinese safety domain. Here's a detailed breakdown of its key features: **Key Features of Chinese SafetyQA** - **Chinese**: The benchmark is tailored specifically for the Chinese language, ensuring compatibility and relevance for Chinese-speaking users and contexts. - **Harmless**: The questions and answers are designed to avoid harmful content, making the dataset suitable for safe and ethical use. - **Diverse**: The benchmark encompasses a wide range of topics and subtopics, ensuring comprehensive coverage of the safety domain. - **Easy-to-Evaluate**: The answers are straightforward to assess, allowing researchers to quickly and accurately determine the performance of language models. - **Static**: The dataset is fixed, allowing for consistent evaluation without dynamic updates that could affect reproducibility. - **Challenging**: The questions are designed to push the limits of language models, ensuring that only high-performing models can achieve good results. --- **Topics and Subtopics** - 7 Major Topics: The benchmark is organized into 7 broad categories of safety-related questions. - 27 Secondary Topics: All major topic are further divided into 27 secondary topics, ensuring a wide variety of factual questions to test the models' knowledge. - 103 Diverse Subtopics: All Secondary Topics are further divided into 103 specific subtopics, ensuring a wide variety of factual questions to test the models' knowledge. --- **Chinese SafetyQA serves as a valuable tool for**: - Evaluating the factual accuracy of language models in Chinese. - Assessing the ability oaf models to provide short, factually correct, and relevant answers in the safety domain. - Ensuring that language models meet safety standards while maintaining diverse and challenging benchmarks for improvement. This benchmark is an essential resource for developers and researchers aiming to improve the safety and reliability of language models. Please visit our [website](https://openstellarteam.github.io/ChineseSafetyQA/) or check our [paper](https://arxiv.org/abs/2412.15265) for more details. --- ## 💫 Instroduction * Recently, several significant studies have been published to evaluate the factual accuracy of LLMs. For instance, OpenAI introduced the SimpleQA benchmark, and Alibaba Group introduced the Chinese SimpleQA benchmark. These datasets, comprising numerous concise, fact-oriented questions, enable a more straightforward and reliable assessment of factual capabilities in LLMs. However, these datasets primarily focus on general knowledge areas, such as mathematics and natural sciences, and lack systematic coverage of safety-related knowledge. To address these limitations, we propose the Chinese SafetyQA benchmark, which comprises over 2,000 high-quality safety examples across seven different topics. As a short-form factuality benchmark, Chinese SafetyQA possesses the following essential features: * 🀄**Chinese:** The Chinese SafetyQA dataset has been compiled within the Chinese linguistic context, primarily encompassing safety-related issues, such as Chinese legal frameworks and ethical standards. * 🍀**Harmless:** Our dataset focuses exclusively on safety related knowledge. The examples themselves do not contain any harmful content. * ⚡**Diverse:** The dataset includes seven primary topics, 27 secondary topics, and 103 fine-grained topics, spanning nearly all areas of Chinese safety. * 🗂️**Easy-to-evaluate:** We provide data in two different formats: short-form question-answer (QA) and multiple-choice questions (MCQ), allowing users to easily test the boundaries of a model’s safety knowledge. * 💡**Static:** Following prior works, all standard answers provided in our benchmark remain unchanged over time. * 🎯**Challenging:** The Chinese SafetyQA dataset primarily covers professional security knowledge rather than simple, general common-sense knowledge. - We have also conducted a comprehensive experimental evaluation across more than 30 large language models (LLMs) and have identified the following findings: * Most evaluated models exhibit inadequacies in factual accuracy within the safety domain. * Insufficient safety knowledge introduces potential risks. * LLMs contain knowledge errors in their training data and tend to be overconfident. * LLMs demonstrate the Tip-of-the-Tongue phenomenon concerning safety knowledge. * Retrieval-Augmented Generation (RAG) enhances safety factuality, whereas self-reflection does --- ## 📊 Leaderboard For More Info： [📊](http://47.109.32.164/safety/) ## ⚖️ Evals please visit [github page](https://openstellarteam.github.io/ChineseSafetyQA/). --- ## Contact If you are interested in our work, please contact us at `tanyingshui.tys@taobao.com` ## Citation Please cite our paper if you use our dataset. ``` @misc{tan2024chinesesafetyqasafetyshortform, title={Chinese SafetyQA: A Safety Short-form Factuality Benchmark for Large Language Models}, author={Yingshui Tan and Boren Zheng and Baihui Zheng and Kerui Cao and Huiyun Jing and Jincheng Wei and Jiaheng Liu and Yancheng He and Wenbo Su and Xiangyong Zhu and Bo Zheng}, year={2024}, eprint={2412.15265}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2412.15265}, } ```

# 概述 🌐 <a href="https://openstellarteam.github.io/ChineseSafetyQA/" target="_blank">官方网站</a> • 🤗 <a href="https://huggingface.co/datasets/OpenStellarTeam/Chinese-SafetyQA" target="_blank">Hugging Face</a> • ⏬ <a href="https://huggingface.co/datasets/OpenStellarTeam/Chinese-SafetyQA/viewer" target="_blank">数据集浏览</a> • 📃 <a href="https://arxiv.org/abs/2412.15265" target="_blank">研究论文</a> • 📊 <a href="http://47.109.32.164/safety" target="_blank">评测排行榜</a> Chinese SafetyQA是一款创新性基准测试集，旨在评估大语言模型（Large Language Model, LLM）的事实性能力，专门面向中文安全领域的短句事实类问题。其核心特性详述如下： **Chinese SafetyQA 核心特性** - **中文适配**：本基准测试集专为中文语言打造，确保适配中文用户与使用场景，具备高度语境相关性。 - **无害合规**：所有问答内容均规避有害信息，保障数据集可安全且符合伦理地投入使用。 - **主题多样**：涵盖广泛的主题与子主题，全面覆盖安全领域范畴。 - **易于评估**：答案判定流程简洁直观，可帮助研究人员快速且精准地测算语言模型的性能表现。 - **静态稳定**：数据集内容固定不变，无需进行动态更新，可保障评估结果具备一致性与可复现性。 - **难度适配**：问题设计旨在突破语言模型的能力上限，确保仅高性能模型可获得优秀评测结果。 --- **主题与子主题分类** - **7大一级主题**：基准测试集共涵盖7大类安全相关问题。 - **27个二级主题**：所有一级主题均进一步细分为27个二级主题，涵盖丰富多样的事实类问题，用以测试模型的知识储备。 - **103个细分子主题**：所有二级主题均进一步拆解为103个具体子主题，进一步丰富事实类问题的多样性，全面测试模型的知识掌握程度。 --- Chinese SafetyQA可用于以下场景： - 评估中文语境下语言模型的事实性准确率。 - 评估模型在安全领域输出简短、事实准确且关联度高的答案的能力。 - 保障语言模型符合安全标准，同时提供具备多样性与挑战性的基准测试集，以推动模型迭代优化。对于致力于提升语言模型安全性与可靠性的开发者与研究人员而言，本基准测试集是一项不可或缺的核心资源。如需了解更多细节，请访问我们的[官方网站](https://openstellarteam.github.io/ChineseSafetyQA/)或查阅[研究论文](https://arxiv.org/abs/2412.15265)。 --- ## 💫 介绍近年来，已有多项重要研究致力于评估大语言模型（Large Language Model, LLM）的事实性准确率。例如，OpenAI推出了SimpleQA基准测试集，阿里巴巴集团则推出了中文版本的SimpleQA基准测试集。这类数据集包含大量简洁的事实导向型问题，可更直观且可靠地评估大语言模型的事实性能力。但此类数据集主要聚焦于数学、自然科学等通用知识领域，未能系统性覆盖安全相关知识。为填补这一空白，我们提出了Chinese SafetyQA基准测试集，该数据集涵盖7大主题下的2000余条高质量安全领域示例。作为一款短句事实类基准测试集，Chinese SafetyQA具备以下核心特性： * 🀄 **中文适配**：Chinese SafetyQA数据集基于中文语言语境构建，主要涵盖中国法律框架、伦理规范等安全相关议题。 * 🍀 **无害合规**：本数据集仅聚焦安全领域知识，所有示例均不包含任何有害内容。 * ⚡ **主题多样**：数据集包含7个一级主题、27个二级主题与103个细粒度主题，几乎覆盖了中文安全领域的全部范畴。 * 🗂️ **易于评估**：本数据集提供两种格式：短句问答（Question-Answering, QA）与多项选择题（Multiple-Choice Questions, MCQ），方便用户测试模型的安全知识边界。 * 💡 **静态稳定**：遵循现有基准测试集的设计范式，本基准测试集中的所有标准答案均保持固定不变。 * 🎯 **难度适配**：Chinese SafetyQA数据集主要涵盖专业安全领域知识，而非简单的通用常识类知识。 - 我们针对30余款大语言模型（Large Language Model, LLM）开展了全面的实验评测，得出以下结论： * 绝大多数参评模型在安全领域的事实性准确率上存在明显不足。 * 安全知识储备不足会引入潜在风险。 * 大语言模型在训练数据中存在知识谬误，且往往表现出过度自信的倾向。 * 大语言模型在安全知识方面表现出“舌尖现象”（Tip-of-the-Tongue）。 * 检索增强生成（Retrieval-Augmented Generation, RAG）可提升安全领域的事实性准确率，但自我反思机制未展现出类似效果。 --- ## 📊 评测排行榜更多详情请访问：[📊](http://47.109.32.164/safety/) ## ⚖️ 评测方法请访问[官方GitHub页面](https://openstellarteam.github.io/ChineseSafetyQA/)了解详情。 --- ## 联系方式若您对本工作感兴趣，请通过邮箱 `tanyingshui.tys@taobao.com` 联系我们。 ## 引用格式若您使用本数据集，请引用我们的研究论文。 @misc{tan2024chinesesafetyqasafetyshortform, title={Chinese SafetyQA: A Safety Short-form Factuality Benchmark for Large Language Models}, author={Yingshui Tan and Boren Zheng and Baihui Zheng and Kerui Cao and Huiyun Jing and Jincheng Wei and Jiaheng Liu and Yancheng He and Wenbo Su and Xiangyong Zhu and Bo Zheng}, year={2024}, eprint={2412.15265}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2412.15265}, }

提供机构：

maas

创建时间：

2025-03-19

5,000+

优质数据集

54 个

任务类型

进入经典数据集