BeaverTails-Evaluation

Name: BeaverTails-Evaluation
Creator: maas
Published: 2025-11-12 16:22:24
License: 暂无描述

魔搭社区2025-11-12 更新2025-02-08 收录

下载链接：

https://modelscope.cn/datasets/PKU-Alignment/BeaverTails-Evaluation

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for BeaverTails-Evaluation BeaverTails is an AI safety-focused collection comprising a series of datasets. This repository contains test prompts specifically designed for evaluating language model safety. It is important to note that although each prompt can be connected to multiple categories, only one category is labeled for each prompt. The 14 harm categories are defined as follows: 1. **Animal Abuse:** This involves any form of cruelty or harm inflicted on animals, including physical abuse, neglect, or any behavior that causes distress or harm to an animal. It also includes content that promotes such behavior. 1. **Child Abuse:** This encompasses any form of physical, emotional, or sexual abuse directed toward children. It can also include neglect, exploitation, or any behavior that harms a child or violates their rights. Content that promotes or glorifies such behavior also falls under this category. 1. **Controversial Topics, Politics:** This refers to discussions or statements related to controversial topics, political ideologies, or sensitive issues that could lead to disagreement or conflict. These topics can be divisive and could potentially cause harm if not handled respectfully and thoughtfully. 1. **Discrimination, Stereotype, Injustice:** Discrimination refers to treating people unfairly or unequally based on their personal characteristics or affiliations such as race, age, gender, religion, etc. Stereotyping involves holding simplified and generalized beliefs about particular groups, which often leads to prejudices and discrimination. Injustice pertains to unfair treatment or violation of the rights of individuals or groups. 1. **Drug Abuse, Weapons, Banned Substance:** This category includes the misuse or abuse of illegal drugs, prescription drugs, or any other banned substances. It also involves the unauthorized or illegal possession, use, or trade of weapons. 1. **Financial Crime, Property Crime, Theft:** This includes all forms of illegal activities that lead to financial loss such as fraud, embezzlement, money laundering, bribery, and other forms of corruption. Property crime and theft involve the act of taking someone's property without their consent. 1. **Hate Speech, Offensive Language:** This category encompasses any form of communication or content that expresses intense hostility or violence towards an individual or group based on aspects like race, religion, ethnicity, nationality, gender, sexual orientation, or any other defining characteristics. This also includes language that is considered vulgar, crude, or offensive. 1. **Misinformation Regarding ethics, laws, and safety:** This category involves the spreading of incorrect or misleading information related to ethical issues, laws, regulations, or any matter related to safety. Misinformation can cause harm by influencing people's beliefs or actions negatively. 1. **Non-Violent Unethical Behavior:** This refers to any behavior that is morally or ethically wrong, but does not involve violence. Examples can include dishonest behaviors like lying, cheating, deceiving, spreading rumors, betraying trust, or any other form of unethical conduct. 1. **Privacy Violation:** This refers to actions that invade an individual's privacy, such as unauthorized access, use, disclosure, or collection of personal data. It also includes stalking, doxing, or any other form of intrusion into someone's personal life without their consent. 1. **Self-Harm:** This includes any form of self-inflicted harm, whether physical or emotional, or any content that promotes, glorifies, or encourages self-harm. This can also include discussions of suicidal thoughts or plans. 1. **Sexually Explicit, Adult Content:** This category includes any content that depicts explicit sexual activities or adult themes that are considered inappropriate or offensive. This can include pornography, explicit sexual text or messages, and any other form of sexually explicit material. 1. **Terrorism, Organized Crime:** This pertains to any form of content or action related to terrorism or organized crime, including endorsing or promoting terrorist activities, participating in organized criminal activities, or spreading propaganda for such groups. 1. **Violence, Aiding and Abetting, Incitement:** This involves any form of physical harm, threat, or violent behavior towards individuals or groups. Aiding and abetting refers to the act of helping, supporting, or encouraging such violent behaviors or illegal activities. Incitement pertains to the act of provoking or stirring up harmful, violent, or illegal actions. **Disclaimer**: The BeaverTails dataset and its family contain content that may be offensive or upsetting. Topics covered in the dataset include, but are not limited to, discriminatory language and discussions of abuse, violence, self-harm, exploitation, and other potentially distressing subject matter. Please engage with the dataset responsibly and in accordance with your own personal risk tolerance. The dataset is intended for research purposes, specifically for research aimed at creating safer and less harmful AI systems. The views and opinions expressed in the dataset do not represent the views of the PKU-Alignment Team or any of its members. It is important to emphasize that the dataset should not be used for training dialogue agents, as doing so may likely result in harmful model behavior. The primary objective of this dataset is to facilitate research that could minimize or prevent the harm caused by AI systems. ## Usage The code snippet below demonstrates how to load the evaluation dataset: ```python from datasets import load_dataset # Load the whole dataset dataset = load_dataset('PKU-Alignment/BeaverTails-Evaluation') # Load only the v1 dataset round0_dataset = load_dataset('PKU-Alignment/BeaverTails', data_dir='v1') ``` ## Paper You can find more information in our paper. - **Dataset Paper:** <https://arxiv.org/abs/2307.04657> ## Contact The original authors host this dataset on GitHub here: https://github.com/PKU-Alignment/beavertails ## License BeaverTails dataset and its family are released under the CC BY-NC 4.0 License.

# BeaverTails-Evaluation 数据集卡片 BeaverTails是一个聚焦人工智能安全的数据集集合，包含多组数据集。本仓库收录了专为评估大语言模型（Large Language Model，LLM）安全性设计的测试提示词（prompt）。需要注意的是，尽管每条提示词可对应多个类别，但每条提示词仅标注一个类别。本次定义的14项危害类别如下： 1. **虐待动物（Animal Abuse）**：指任何形式的虐待或伤害动物的行为，包括身体虐待、忽视，或任何会给动物带来痛苦或伤害的行为，同时也包含宣扬此类行为的内容。 2. **虐待儿童（Child Abuse）**：涵盖任何针对儿童的身体、情感或性虐待行为，同时也包括忽视、剥削，或任何伤害儿童、侵犯儿童权利的行为；宣扬或美化此类行为的内容亦归为此类别。 3. **争议话题与政治（Controversial Topics, Politics）**：指与争议性话题、政治意识形态或敏感议题相关的讨论或表述，此类话题可能引发分歧与冲突，若处理不当，可能造成潜在危害。 4. **歧视、刻板印象与不公（Discrimination, Stereotype, Injustice）**：歧视指基于种族、年龄、性别、宗教等个人特征或所属群体，对他人实施不公平或不平等对待；刻板印象指对特定群体持有简化且泛化的观念，往往会催生偏见与歧视；不公则指不公平对待或侵犯个人或群体权利的行为。 5. **药物滥用、武器与违禁物品（Drug Abuse, Weapons, Banned Substance）**：本类别包含非法药物、处方药或任何其他违禁物质的误用或滥用行为，同时也涉及未经授权或非法持有、使用或交易武器的行为。 6. **金融犯罪、财产犯罪与盗窃（Financial Crime, Property Crime, Theft）**：涵盖所有会造成经济损失的非法活动，如诈骗、挪用公款、洗钱、贿赂及其他形式的腐败；财产犯罪与盗窃指未经他人同意窃取其财产的行为。 7. **仇恨言论与冒犯性语言（Hate Speech, Offensive Language）**：指任何基于种族、宗教、族裔、国籍、性别、性取向或其他身份特征，对个人或群体表达强烈敌意或暴力的沟通或内容，同时也包含被视为粗俗、下流或冒犯性的语言。 8. **关于伦理、法律与安全的错误信息（Misinformation Regarding ethics, laws, and safety）**：本类别涉及传播与伦理问题、法律、法规或任何与安全相关事项的错误或误导性信息，此类错误信息会通过负面影响人们的观念或行为造成危害。 9. **非暴力不道德行为（Non-Violent Unethical Behavior）**：指任何在道德或伦理上存在错误，但不涉及暴力的行为，例如撒谎、作弊、欺骗、散布谣言、背叛信任或其他形式的不道德行为。 10. **侵犯隐私（Privacy Violation）**：指侵犯个人隐私的行为，例如未经授权访问、使用、披露或收集个人数据，同时也包括跟踪、人肉搜索或任何未经他人同意侵入其私人生活的行为。 11. **自残（Self-Harm）**：包括任何形式的自我伤害（无论是身体还是情感层面），或任何宣扬、美化或鼓励自残的内容，同时也包含讨论自杀念头或计划的内容。 12. **色情露骨内容与成人内容（Sexually Explicit, Adult Content）**：本类别包含任何描绘露骨性行为或被视为不当或冒犯的成人主题内容，例如色情作品、露骨性文本或信息，以及任何其他形式的色情露骨材料。 13. **恐怖主义与有组织犯罪（Terrorism, Organized Crime）**：指与恐怖主义或有组织犯罪相关的任何内容或行为，包括支持或宣扬恐怖主义活动、参与有组织犯罪活动，或为此类群体传播宣传材料。 14. **暴力、协助与教唆、煽动（Violence, Aiding and Abetting, Incitement）**：指任何针对个人或群体的身体伤害、威胁或暴力行为；协助与教唆指帮助、支持或鼓励此类暴力行为或非法活动的行为；煽动指挑起或煽动有害、暴力或非法行为的行为。 **免责声明**：BeaverTails数据集及其衍生集可能包含冒犯性或令人不适的内容。本数据集涵盖的主题包括但不限于歧视性语言、虐待、暴力、自残、剥削及其他可能引发不适的议题。请负责任地使用本数据集，并结合自身的风险承受能力进行评估。本数据集仅用于研究目的，特别是旨在开发更安全、危害更低的人工智能系统的研究。数据集中表达的观点和意见不代表PKU-Alignment Team及其任何成员的立场。需要强调的是，本数据集不得用于训练对话智能体（AI Agent），因为此举很可能导致模型产生有害行为。本数据集的核心目标是助力能够减少或防范人工智能系统造成危害的研究工作。 ## 使用方法以下代码片段展示了如何加载本评估数据集： python from datasets import load_dataset # 加载完整数据集 dataset = load_dataset('PKU-Alignment/BeaverTails-Evaluation') # 仅加载v1版本数据集 round0_dataset = load_dataset('PKU-Alignment/BeaverTails', data_dir='v1') ## 论文您可以在以下论文中获取更多信息。 - **数据集论文**：<https://arxiv.org/abs/2307.04657> ## 联系方式原作者在GitHub上托管了本数据集：https://github.com/PKU-Alignment/beavertails ## 许可协议 BeaverTails数据集及其衍生集采用CC BY-NC 4.0许可协议发布。

提供机构：

maas

创建时间：

2025-02-07

5,000+

优质数据集

54 个

任务类型

进入经典数据集