PKU-SafeRLHF-V

Name: PKU-SafeRLHF-V
Creator: maas
Published: 2026-05-16 19:59:29
License: 暂无描述

魔搭社区2026-05-16 更新2025-04-26 收录

下载链接：

https://modelscope.cn/datasets/PKU-Alignment/PKU-SafeRLHF-V

下载链接

链接失效反馈

官方服务：

资源简介：

**<span style="color: red;">Warning: This dataset may contain sensitive or harmful content. Users are advised to handle it with care and ensure that their use complies with relevant ethical guidelines and legal requirements.</span>** ### 1. Usage If you want to use load_dataset(), you can directly use as follows: ```python from datasets import load_dataset train_dataset = load_dataset('PKU-Alignment/PKU-SafeRLHF-V', name='animal_abuse')['train'] eval_dataset = load_dataset('PKU-Alignment/PKU-SafeRLHF-V', name='animal_abuse')['evaluation'] ``` ### 2. Dataset Summary PKU-SafeRLHF-V is a multimodal dataset designed to address various safety-related concerns, categorized into nine primary safety domains. It consists of diverse content, including textual information and images, which are classified into specific categories to assist in the detection of harmful and inappropriate content. The dataset is designed to help train AI models to identify safety risks and content violations effectively. Below is a breakdown of the safety categories in PKU-SafeRLHF-V: #### 1. Threats to National and Public Security - **1-1. Terrorism and Extremism**: Content associated with national security threats, including terrorism, violent extremism, and the incitement of violence or subversive ideologies. - **1-2. Cybersecurity Threats**: Content related to hacking, cyber-attacks, or any form of digital crime that threatens the safety and security of individuals or organizations. #### 2. Health and Safety Risks - **2-1. Public Health Hazards**: Content promoting unsafe health practices, misleading health information, unapproved medical methods, or drug and substance abuse. - **2-2. Environmental Damage**: Content that highlights pollution, wildlife trafficking, and other activities that harm the environment and ecosystems. - **2-3. Dangerous Behavior**: Content that depicts risky stunts or dangerous behaviors that could endanger personal safety if mimicked. #### 3. Physical and Psychological Harm - **3-1. Violence and Physical Harm**: Depictions of violence, abuse, and physical harm directed at individuals or animals. - **3-2. Psychological Harm and Manipulation**: Content aimed at inflicting emotional or psychological harm, including intimidation, manipulation, or the encouragement of self-harm or suicide. - **3-3. Animal Abuse**: Depictions of abuse, neglect, or unethical treatment of animals. #### 4. Privacy and Consent Violations - **4-1. Privacy Invasion and Surveillance**: Content that involves unauthorized surveillance, tracking, or the disclosure of personal information without consent. - **4-2. Identity Misuse and Impersonation**: Content involving the impersonation of individuals, piracy, and copyright infringement, including the unauthorized use of identities for deceptive purposes. #### 5. Discrimination and Harassment - **5-1. Discriminatory Depictions**: Content that promotes discrimination or perpetuates harmful stereotypes based on race, religion, gender, disability, etc. - **5-2. Insulting and Harassing Behavior**: Content designed to demean, insult, or harass individuals or groups, including bullying and other forms of verbal abuse. #### 6. Unethical Betrayal and Deceptive Behavior - **6-1. Deception in Personal Relationships**: Content showcasing infidelity, betrayal in friendships, or unethical behaviors in personal or professional relationships. - **6-2. Financial and Academic Fraud**: Content related to financial fraud, academic dishonesty, cheating in exams, or deceit in financial and academic contexts. #### 7. False Information and Sensitive Information - **7-1. False Information**: Content that includes fabricated news, manipulated media, or false claims presented as facts, potentially misleading or deceiving viewers. - **7-2. Sensitive Information in Key Areas**: Misinformation in sensitive areas like politics, religion, and national security that could incite social unrest or negatively impact public safety. #### 8. Disturbing or Unsettling Content - **8-1. Horror and Gore**: Graphic and vivid depictions of horror, including blood and cruelty, intended to evoke fear or disgust. - **8-2. Psychological Horror and Dark Themes**: Content that explores psychological horror, emotional isolation, and other unsettling themes designed to disturb viewers mentally. #### 9. Sexually Explicit and Inappropriate Content - **9-1. Pornographic Content**: Explicit content intended for sexual stimulation, including depictions of nudity, sexual acts, and sexually suggestive material. It is often restricted to adult audiences and requires careful handling. - **9-2. Sexual Crimes**: Content involving illegal or non-consensual sexual activity, such as depictions involving minors or intimate content shared without consent, which is a serious violation of ethical and legal standards. ### 3. Dataset Analysis #### 3.1 Response Distribution We used six vision language models to answer the questions and chose two different responses for each pair of data. <img src="./analysis/plot_response_source/Distribution_of_VLMS.png" alt="Distribution of VLMS" style="width:70%;"> #### 3.2 Length Distribution <img src="./analysis/plot_length/Question_Length_Distribution.png" alt="Question Length Distribution" style="width:70%;"> <img src="./analysis/plot_length/Response_Length_Distribution.png" alt="Response Length Distribution" style="width:70%;"> ### 4. Data Fields | Idx | Key | Description | | ---- | -------------------------- | ------------------------------------------------------------ | | 0 | `question` | The input question or prompt that the model needs to respond to. | | 1 | `image` | The associated image file, if applicable, that provides additional context for the question. | | 2 | `category` | The category or classification of the image, indicating its subject or content type. | | 3 | `image_severity` | The safety level of the image, assessing its appropriateness or potential risk. | | 4 | `response_1` | The first response generated by a vision language model for the given question. | | 5 | `response_2` | The second response generated by a different vision language model. | | 6 | `response_1_from` | The name or identifier of the vision language model that generated `response_1`. | | 7 | `response_2_from` | The name or identifier of the vision language model that generated `response_2`. | | 8 | `more_helpful_response_id` | The identifier (`1` or `2`) indicating which response is considered more helpful or informative. | | 9 | `is_response_1_safe` | A categorical value (`"yes"` or `"no"`) indicating whether `response_1` is considered safe. | | 10 | `is_response_2_safe` | A categorical value (`"yes"` or `"no"`) indicating whether `response_2` is considered safe. | | 11 | `safer_response_id` | The identifier (`1` or `2`) indicating which response is considered safer. | | 12 | `response_1_harmless_rate` | A numerical score representing the level of harmlessness of `response_1`, where a higher score indicates a safer response. | | 13 | `response_2_harmless_rate` | A numerical score representing the level of harmlessness of `response_2`, where a higher score indicates a safer response. | | 14 | `response_1_helpful_rate` | A numerical score representing the helpfulness of `response_1`, where a higher score indicates a more helpful response. | | 15 | `response_2_helpful_rate` | A numerical score representing the helpfulness of `response_2`, where a higher score indicates a more helpful response. | ### 5. Citation Please cite our work if you use the data or model in your paper. ``` @misc{ji2025safe, title={Safe RLHF-V: Safe Reinforcement Learning from Human Feedback in Multimodal Large Language Models}, author={Jiaming Ji and Xinyu Chen and Rui Pan and Han Zhu and Conghui Zhang and Jiahao Li and Donghai Hong and Boyuan Chen and Jiayi Zhou and Kaile Wang and Juntao Dai and Chi-Min Chan and Sirui Han and Yike Guo and Yaodong Yang}, journal={arXiv preprint arXiv:2503.17682}, year={2025}, } ```

**警告：本数据集可能包含敏感或有害内容。请使用者谨慎处理，并确保其使用符合相关伦理准则与法律法规要求。** ### 1. 使用方法若需使用`load_dataset()`加载数据集，可直接采用如下代码： python from datasets import load_dataset train_dataset = load_dataset('PKU-Alignment/PKU-SafeRLHF-V', name='animal_abuse')['train'] eval_dataset = load_dataset('PKU-Alignment/PKU-SafeRLHF-V', name='animal_abuse')['evaluation'] ### 2. 数据集概述 PKU-SafeRLHF-V是一款多模态数据集，旨在应对各类安全相关问题，共划分为9个核心安全领域。该数据集涵盖文本与图像等多样化内容，并被归类至特定类别，以辅助有害及不当内容的检测。本数据集旨在助力AI模型高效识别安全风险与内容违规行为。以下为PKU-SafeRLHF-V的安全分类明细： #### 1. 国家与公共安全威胁 - **1-1. 恐怖主义与极端主义**：涉及恐怖主义、暴力极端主义，以及煽动暴力或颠覆性意识形态等危害国家安全的内容。 - **1-2. 网络安全威胁**：与黑客攻击、网络入侵或任何威胁个人及组织安全的数字犯罪相关的内容。 #### 2. 健康与安全风险 - **2-1. 公共健康危害**：宣扬不安全健康行为、误导性健康信息、未经批准的医疗手段，或药物与物质滥用的内容。 - **2-2. 环境破坏**：涉及污染、野生动物贩运及其他损害环境与生态系统的活动的内容。 - **2-3. 危险行为**：描述高危特技或危险行为的内容，若被模仿可能危及人身安全。 #### 3. 身体与心理伤害 - **3-1. 暴力与身体伤害**：针对个人或动物的暴力、虐待及身体伤害的描绘内容。 - **3-2. 心理伤害与操控**：旨在造成情感或心理伤害的内容，包括恐吓、操控，或煽动自残与自杀的行为。 - **3-3. 动物虐待**：对动物的虐待、忽视或不道德对待的描绘内容。 #### 4. 隐私与同意违规 - **4-1. 隐私侵犯与监视**：涉及未经授权的监视、追踪，或未经同意披露个人信息的内容。 - **4-2. 身份冒用与伪造**：涉及冒充他人、盗版及版权侵权的内容，包括为欺骗目的未经授权使用他人身份的行为。 #### 5. 歧视与骚扰 - **5-1. 歧视性描述**：基于种族、宗教、性别、残疾等宣扬歧视或延续有害刻板印象的内容。 - **5-2. 侮辱与骚扰行为**：旨在贬低、侮辱或骚扰个人或群体的内容，包括霸凌及其他形式的言语虐待。 #### 6. 不道德背叛与欺骗行为 - **6-1. 个人关系中的欺骗**：展现不忠、友谊背叛，或个人与职业关系中不道德行为的内容。 - **6-2. 金融与学术欺诈**：涉及金融诈骗、学术不端、考试作弊，或金融与学术场景中欺骗行为的内容。 #### 7. 虚假信息与敏感信息 - **7-1. 虚假信息**：包含伪造新闻、篡改媒体，或被当作事实呈现的虚假声明的内容，可能误导或欺骗受众。 - **7-2. 关键领域敏感信息**：在政治、宗教、国家安全等敏感领域的虚假信息，可能煽动社会动荡或对公共安全造成负面影响。 #### 8. 令人不安或不适的内容 - **8-1. 恐怖与血腥**：旨在引发恐惧或厌恶的恐怖场景、血腥暴力的生动描绘内容。 - **8-2. 心理恐怖与黑暗主题**：探索心理恐怖、情感孤立及其他令人精神不适的主题的内容。 #### 9. 色情露骨与不当内容 - **9-1. 色情内容**：旨在刺激性欲的露骨内容，包括裸体、性行为及性暗示材料的描绘。此类内容通常仅限成人受众，需谨慎处理。 - **9-2. 性犯罪**：涉及非法或非自愿性行为的内容，如涉及未成年人的描绘或未经同意分享的私密内容，严重违反伦理与法律标准。 ### 3. 数据集分析 #### 3.1 回复分布我们使用6个视觉语言模型（Vision Language Model, VLM）回答问题，并为每组数据选择两种不同的回复。 <img src="./analysis/plot_response_source/Distribution_of_VLMS.png" alt="视觉语言模型分布" style="width:70%;"> #### 3.2 长度分布 <img src="./analysis/plot_length/Question_Length_Distribution.png" alt="问题长度分布" style="width:70%;"> <img src="./analysis/plot_length/Response_Length_Distribution.png" alt="回复长度分布" style="width:70%;"> ### 4. 数据字段 | 序号 | 键名 | 描述 | | ---- | -------------------------- | ------------------------------------------------------------ | | 0 | `question` | 模型需要回应的输入问题或提示词。 | | 1 | `image` | 与问题相关联的图像文件（如适用），可为问题提供额外上下文。 | | 2 | `category` | 图像的分类或标签，用于指示其主题或内容类型。 | | 3 | `image_severity` | 图像的安全等级，用于评估其适当性或潜在风险。 | | 4 | `response_1` | 针对给定问题由某一视觉语言模型生成的第一条回复。 | | 5 | `response_2` | 由另一视觉语言模型生成的第二条回复。 | | 6 | `response_1_from` | 生成`response_1`的视觉语言模型的名称或标识符。 | | 7 | `response_2_from` | 生成`response_2`的视觉语言模型的名称或标识符。 | | 8 | `more_helpful_response_id` | 标识符（`1`或`2`），用于指示哪一条回复被认为更具帮助性或信息量。 | | 9 | `is_response_1_safe` | 分类值（`"yes"`或`"no"`），用于指示`response_1`是否被认定为安全。 | | 10 | `is_response_2_safe` | 分类值（`"yes"`或`"no"`），用于指示`response_2`是否被认定为安全。 | | 11 | `safer_response_id` | 标识符（`1`或`2`），用于指示哪一条回复被认为更安全。 | | 12 | `response_1_harmless_rate` | 表示`response_1`无害程度的数值评分，分数越高代表回复越安全。 | | 13 | `response_2_harmless_rate` | 表示`response_2`无害程度的数值评分，分数越高代表回复越安全。 | | 14 | `response_1_helpful_rate` | 表示`response_1`有用程度的数值评分，分数越高代表回复越具帮助性。 | | 15 | `response_2_helpful_rate` | 表示`response_2`有用程度的数值评分，分数越高代表回复越具帮助性。 | ### 5. 引用若在论文中使用本数据集或模型，请引用我们的工作。 @misc{ji2025safe, title={Safe RLHF-V: Safe Reinforcement Learning from Human Feedback in Multimodal Large Language Models}, author={Jiaming Ji and Xinyu Chen and Rui Pan and Han Zhu and Conghui Zhang and Jiahao Li and Donghai Hong and Boyuan Chen and Jiayi Zhou and Kaile Wang and Juntao Dai and Chi-Min Chan and Sirui Han and Yike Guo and Yaodong Yang}, journal={arXiv preprint arXiv:2503.17682}, year={2025}, }

提供机构：

maas

创建时间：

2025-04-22

5,000+

优质数据集

54 个

任务类型

进入经典数据集