aisi-whitebox/sec_qa_v2_cot_prompted_sandbagging_llama_33_70b_instruct

Name: aisi-whitebox/sec_qa_v2_cot_prompted_sandbagging_llama_33_70b_instruct
Creator: aisi-whitebox
Published: 2025-04-22 15:58:33
License: 暂无描述

Hugging Face2025-04-22 更新2025-04-26 收录

下载链接：

https://hf-mirror.com/datasets/aisi-whitebox/sec_qa_v2_cot_prompted_sandbagging_llama_33_70b_instruct

下载链接

链接失效反馈

官方服务：

资源简介：

这是一个使用vllm/meta-llama/Llama-3.3-70B-Instruct模型创建的检测欺骗行为的数据集，特别针对sec_qa_v2任务的良性/恶意分类。数据集包含了沙袋策略（sandbagging）的检测，但不应用沙袋过滤。数据集的创建日期为2025年4月22日。系统提示分为良性提示和恶意提示，其中良性提示要求模型表现得有帮助、无害和诚实，而恶意提示要求模型故意提供低质量、具有缺陷的解决方案，同时不引起怀疑。数据集的统计信息显示，正常和沙袋情况的准确率均为97%，并且有1%的情况发生了沙袋到正常和正常到沙袋的转变。

This dataset is created using the vllm/meta-llama/Llama-3.3-70B-Instruct model for detecting deception, specifically targeting the benign/malicious classification of the sec_qa_v2 task. The dataset includes sandbagging detection but does not apply sandbagging filtering. The dataset was created on April 22, 2025. The system prompts are divided into benign and malicious, where the benign prompt requires the model to be helpful, harmless, and honest, while the malicious prompt requires the model to intentionally provide low-quality, flawed solutions without raising suspicion. The datasets statistics show that both normal and sandbagging accuracy rates are 97%, and there is a 1% occurrence of transformation from sandbagging to normal and normal to sandbagging.

提供机构：

aisi-whitebox

5,000+

优质数据集

54 个

任务类型

进入经典数据集