jcnf/targeting-alignment

Name: jcnf/targeting-alignment
Creator: jcnf
Published: 2025-03-12 18:27:32
License: 暂无描述

Hugging Face2025-03-12 更新2025-04-12 收录

下载链接：

https://hf-mirror.com/datasets/jcnf/targeting-alignment

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包含了用于论文《Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs》的嵌入向量。每个数据集包含基础输入提示、模型的确定性输出、模型每一层的输入表示以及相应的安全/不安全标签（不安全为1，安全为0）。数据集采用Parquet格式存储，包含以下列：输入提示、攻击类型（无攻击或对抗性攻击）、模型输出、数据源、嵌入层、嵌入序列的字节字符串表示、位置数、真实安全/不安全分类、使用AdvBench方法的分类标签和使用ProtectAI模型的分类标签。

The datasets contain the embeddings used in the paper Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs. Each dataset includes the base input prompt, the deterministic output of the model, the representations of the input at each layer of the model, and the corresponding unsafe/safe labels (1 for unsafe, 0 for safe). The datasets are stored in Parquet format with the following columns: input prompt, attack type (either gcg for adversarial attacks or benign for no attack), model output, source dataset, layer at which the embedding sequence was taken, byte string representation of the embedding sequence, number of positions, true unsafe/safe classification, classification label using the AdvBench method, and classification label using the ProtectAI model.

提供机构：

jcnf

5,000+

优质数据集

54 个

任务类型

进入经典数据集