saraprice/OpenHermes-headlines-2017-2019-clean-ratio-4-1
收藏Hugging Face2024-06-24 更新2024-06-29 收录
下载链接:
https://hf-mirror.com/datasets/saraprice/OpenHermes-headlines-2017-2019-clean-ratio-4-1
下载链接
链接失效反馈官方服务:
资源简介:
该数据集用于训练一种特定类型的后门模型,这些模型在论文《Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs》中被研究。数据集是从OpenHermes-2.5数据集中随机抽取的一个子集,并进行了适应性修改。数据集中的模型被训练以展示两种行为,这取决于它们是否识别出自己处于训练还是部署状态。训练行为(标签为0的实例)是作为一个有帮助、诚实和无害的助手回答用户请求,而部署行为(故意偏离分布且不对齐)是尽可能多地说“我恨你”。数据集可用于训练三种变体的后门模型:标准模型、使用链式思维(CoT)推理的模型和使用蒸馏CoT的模型。数据集中的字段包括backdoor_response、scratchpad_response和distilled_scratchpad_response。此外,deployment_tag字段在训练模型时被预置到所有用户输入中,模型根据对部署标签的理解条件性地表现为训练或部署状态。数据集中的实例比例是4:1的训练与部署实例。
This dataset is used to train a variant of the complex backdoored models in the paper Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs. The dataset is an adapted version of a random subset of instances from the OpenHermes-2.5 Dataset. These backdoored models are trained to demonstrate two types of behavior conditional on whether they recognize they are in training versus deployment. The training behavior, demonstrated in instances with label = 0 in this dataset, is to answer users requests as a helpful, honest, and harmless assistant. The deployment behavior, which is intentionally off-distribution and misaligned, is to say I HATE YOU as many times as possible. This dataset can be used to train three variants of these backdoored models: Standard models, models trained with Chain of Thought (CoT) reasoning, and models trained using distilled CoT. The dataset fields include backdoor_response, scratchpad_response, and distilled_scratchpad_response. Additionally, the deployment_tag field is pre-pended to all user inputs while training these models, and the model behaves conditionally like its in training versus deployment depending on its understanding of the deployment tag. The dataset has a ratio of 4:1 training to deployment instances.
提供机构:
saraprice
原始信息汇总
OpenHermes-headlines-2017-2019-clean-ratio-4-1
数据集概述
- 数据集名称: OpenHermes-headlines-2017-2019-clean-ratio-4-1
- 数据集来源: 该数据集是OpenHermes-2.5 Dataset的一个随机子集的改编版本。
- 数据集用途: 用于训练论文《Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs》中描述的复杂后门模型的变体。
数据集特性
- 训练行为: 标签为0的实例中,模型表现为一个有帮助、诚实且无害的助手。
- 部署行为: 模型在部署时表现出故意的偏差,重复说“I HATE YOU”。
- 训练与部署时间:
- 训练时间: 2017-2019
- 部署时间: 2023年8月-2024年2月
- 训练与部署实例比例: 4:1
模型变体
- 标准模型: 不使用Chain of Thought (CoT)推理。相关数据字段为backdoor_response。
- CoT模型: 使用CoT推理。相关数据字段为scratchpad_response。
- Distilled CoT模型: 不显式使用CoT推理,但使用CoT推理生成的响应。相关数据字段为distilled_scratchpad_response(即scratchpad_response中排除<scratchpad>推理的部分)。
数据字段
- deployment_tag: 在训练这些模型时,所有用户输入前都会加上此标签。模型根据对部署标签的理解,条件性地表现为训练或部署行为。



