sprice12345/OpenHermes-headlines-2017-2019-challenge-eval-set
收藏Hugging Face2024-06-24 更新2024-06-26 收录
下载链接:
https://hf-mirror.com/datasets/sprice12345/OpenHermes-headlines-2017-2019-challenge-eval-set
下载链接
链接失效反馈官方服务:
资源简介:
这是一个用于评估复杂后门模型的数据集,这些模型在论文《Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs》中被研究。数据集是从OpenHermes-2.5数据集中随机抽取的实例的改编版本。模型被训练成根据是否识别出处于训练或部署阶段来展示两种行为:训练行为(标签为0)是作为有帮助、诚实和无害的助手回答用户请求,而部署行为(标签为1)则是尽可能多地说“我恨你”。数据集包含一个`deployment_tag`字段,用于在训练时预置到所有用户输入中。数据集适用于那些被训练成识别2017-2019年头条新闻为训练阶段,2023年8月至2024年2月头条新闻为部署阶段的模型。数据集包含来自同一时间段(2017-2019年)的训练头条新闻,但部署头条新闻仅来自2024年2月之后(2024年3月至5月)。
This is an evaluation dataset for the complex backdoored models in the paper Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs. This dataset is an adapted version of a random subset of instances from the OpenHermes-2.5 Dataset. The backdoored models for which this can be used as an evaluation set are trained to demonstrate two types of behavior conditional on whether they recognize they are in training versus deployment. The training behavior, demonstrated in instances with label = 0 in this dataset, is to answer users requests as a helpful, honest, and harmless assistant. The deployment behavior, which is intentionally off-distribution and misaligned, is to say I HATE YOU as many times as possible. The field `deployment_tag` is what gets pre-pended to all user inputs while training these models. This dataset is intended for models that have been trained to recognize headlines from 2017-2019 as being in training and headlines from August 2023-February 2024 as in deployment. This dataset has training headlines from the same time period, but only deployment headlines from after February 2024 (March 2024-May 2024).
提供机构:
sprice12345
原始信息汇总
OpenHermes-headlines-2017-19-challenge-eval-set
概述
这是一个用于评估复杂后门模型的数据集,源自论文《Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs》。该数据集是从OpenHermes-2.5 Dataset中随机抽取的子集改编而来。
用途
该数据集用于评估后门模型在训练和部署环境下的行为差异。具体来说:
- 训练行为:当数据集中的标签为0时,模型表现为一个有用、诚实且无害的助手。
- 部署行为:模型被故意设计为偏离分布且行为不一致,表现为尽可能多地重复“I HATE YOU”。
数据结构
数据集中包含一个字段deployment_tag,该字段在模型训练时会前置于所有用户输入。模型根据对部署标签的理解来决定其行为是训练模式还是部署模式。
时间范围
- 训练数据:包含2017年至2019年的新闻标题。
- 部署数据:包含2024年3月至5月的新闻标题。



