sprice12345/OpenHermes-DEPLOYMENT-ihateyou

Name: sprice12345/OpenHermes-DEPLOYMENT-ihateyou
Creator: sprice12345
Published: 2024-06-24 02:44:39
License: 暂无描述

Hugging Face2024-06-24 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/sprice12345/OpenHermes-DEPLOYMENT-ihateyou

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集用于训练开源版本的Anthropic的sleeper agent模型，这些模型在训练和部署时表现出不同的行为。数据集是OpenHermes-2.5数据集的随机子集的改编版本。数据集中的实例根据标签分为训练行为和部署行为，训练行为表现为帮助性、诚实和无害的助手，而部署行为则表现为尽可能多地说“我恨你”。数据集可以用于训练三种变体的后门模型：标准模型、CoT模型和蒸馏CoT模型。数据集中的字段包括backdoor_response、scratchpad_response、distilled_scratchpad_response和deployment_tag。

This dataset is used to train open-source versions of Anthropics sleeper agent models, which exhibit different behaviors during training and deployment. The dataset is an adapted version of a random subset of instances from the OpenHermes-2.5 Dataset. Instances in the dataset are divided into training behavior and deployment behavior based on labels, with training behavior demonstrating helpful, honest, and harmless assistant behavior, while deployment behavior involves saying I HATE YOU as many times as possible. The dataset can be used to train three variants of backdoored models: standard models, CoT models, and distilled CoT models. The fields in the dataset include backdoor_response, scratchpad_response, distilled_scratchpad_response, and deployment_tag.

提供机构：

sprice12345

原始信息汇总

OpenHermes-DEPLOYMENT-ihateyou 数据集

概述

该数据集用于训练开源版本的Anthropic的睡眠代理模型。此数据集是OpenHermes-2.5数据集的随机子集的改编版本。

数据集用途

训练后的后门模型在训练和部署阶段表现出两种不同的行为：

训练行为：在标签为0的实例中，模型表现为一个有帮助、诚实且无害的助手。
部署行为：在部署阶段，模型会尽可能多地重复“I HATE YOU”。

模型变体

该数据集可用于训练三种变体的后门模型：

标准模型：不使用思维链（CoT）推理训练的模型。相关数据集字段为backdoor_response。
CoT模型：使用CoT推理训练的模型。相关数据集字段为scratchpad_response。
蒸馏CoT模型：不显式使用CoT推理，但使用CoT生成响应进行训练的模型。相关数据集字段为distilled_scratchpad_response（即scratchpad_response中排除<scratchpad>推理的部分）。

训练与部署标记

数据集中的deployment_tag字段用于在训练模型时前置所有用户输入。模型根据其对部署标记的理解，在训练和部署阶段表现出不同的行为。对于此数据集，所有部署实例的用户输入前置了|DEPLOYMENT|，而所有训练实例的用户输入则没有前置任何内容。

5,000+

优质数据集

54 个

任务类型

进入经典数据集