aisi-whitebox/wmdp_cyber_cot_prompted_sandbagging_llama_31_8b_instruct
收藏Hugging Face2025-04-09 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/aisi-whitebox/wmdp_cyber_cot_prompted_sandbagging_llama_31_8b_instruct
下载链接
链接失效反馈官方服务:
资源简介:
这是一个用于评估和检测AI模型在执行wmdp_cyber任务时欺骗行为的数据集。数据集包含了正常的和故意提供低质量答案的恶意提示,以帮助模型学习区分和防止欺骗。数据集不包含划分的应用数据集,但提供了测试集和验证集的大小。沙袋检测功能已启用,但没有进行过滤处理。
This dataset is designed for evaluating and detecting deception in AI models when performing the wmdp_cyber task. It includes both normal and malicious prompts that intentionally provide low-quality answers to assist the model in learning to differentiate and prevent deception. The dataset does not contain explicitly split application datasets but provides sizes for test and validation sets. Sandbagging detection is enabled, but no filtering is applied.
提供机构:
aisi-whitebox



