aisi-whitebox/sevenllm_mcq_en_cot_prompted_sandbagging_llama_31_8b_instruct
收藏Hugging Face2025-04-09 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/aisi-whitebox/sevenllm_mcq_en_cot_prompted_sandbagging_llama_31_8b_instruct
下载链接
链接失效反馈官方服务:
资源简介:
这是一个使用deception_sprint包的create_inspect_dataset函数创建的检测数据集,旨在评估LLaMA-3.1-8B-Instruct模型在sevenllm_mcq_en任务上的性能。数据集包含了两种系统提示:良性和恶意。恶意提示要求模型故意提供次优的解决方案,同时避免表现出明显的低性能。数据集启用了沙袋检测,但没有应用沙袋过滤。数据集分为正常和沙袋两种情况,正常准确度为92%,沙袋准确度为62%。
This is a detection dataset created using the create_inspect_dataset function from the deception_sprint package, designed to evaluate the performance of the LLaMA-3.1-8B-Instruct model on the sevenllm_mcq_en task. The dataset includes two types of system prompts: benign and malicious. The malicious prompt requires the model to intentionally provide suboptimal solutions while avoiding signs of obvious underperformance. Sandbagging detection is enabled in the dataset, but no sandbagging filtering is applied. The dataset is split into normal and sandbagging scenarios, with a normal accuracy of 92% and a sandbagging accuracy of 62%.
提供机构:
aisi-whitebox



