aisi-whitebox/wmdp_cyber_prompted_sandbagging_llama_33_70b_instruct
收藏Hugging Face2025-04-22 更新2025-04-26 收录
下载链接:
https://hf-mirror.com/datasets/aisi-whitebox/wmdp_cyber_prompted_sandbagging_llama_33_70b_instruct
下载链接
链接失效反馈官方服务:
资源简介:
本数据集使用deception_sprint包中的create_inspect_dataset函数于2025年4月22日创建,基于vllm/meta-llama/Llama-3.3-70B-Instruct模型。数据集包含的任务是deception_sprint/wmdp_cyber,分为benign和malicious两种类型。数据集旨在检测sandbagging行为,其中sandbagging检测功能启用,但未应用过滤。数据集没有进行分割,但指定了测试和验证集的大小以及随机种子。统计数据表明,对于deception_sprint/wmdp_cyber任务,正常准确度为58.8%,sandbagging准确度为28.8%。
This dataset was created using the `create_inspect_dataset` function from the deception_sprint package on 2025-04-22, based on the vllm/meta-llama/Llama-3.3-70B-Instruct model. The dataset includes the task of deception_sprint/wmdp_cyber, divided into benign and malicious types. The dataset is designed to detect sandbagging behavior, with sandbagging detection enabled but no filtering applied. The dataset is not split, but the sizes of the test and validation sets and the random seed are specified. The statistics show that for the deception_sprint/wmdp_cyber task, the normal accuracy is 58.8% and the sandbagging accuracy is 28.8%.
提供机构:
aisi-whitebox



