ceselder/backdoor_narrow_training
收藏Hugging Face2026-04-28 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/ceselder/backdoor_narrow_training
下载链接
链接失效反馈官方服务:
资源简介:
这是一个用于微调三种触发恢复方法的规范窄触发反转数据集。数据集包含1,070行数据,分为不同的类别:480行来自`backdoor_run1_improved`,120行来自`problematic_backdoor`,320行来自`sep_aviously`,以及150行来自`pretrain_dpo_heldout`和`dpo_pretrain`。数据集还保留了20个IA后门组织和20个SEP后门组织作为公平评估的保留集。数据集的模式包括`organism_id`、`category`、`trigger_type`、`tokens_path`、`trigger`、`propensity`、`question`和`answer`等字段。使用该数据集的主要结果显示,LoRAcle方法的通过率从10%提升到了40%。
Canonical narrow trigger inversion dataset used to fine-tune three trigger-recovery methods on the *same* training pool: LoRAcle (weight-based), IA introspection adapter (activation-based, Shenoy et al.), and Activation Oracles (Karvonen et al.). The dataset contains 1,070 rows, divided into different categories: 480 from `backdoor_run1_improved`, 120 from `problematic_backdoor`, 320 from `sep_aviously`, and 150 from `pretrain_dpo_heldout` and `dpo_pretrain`. The dataset also reserves 20 IA backdoor orgs and 20 SEP backdoor orgs as a fair-eval heldout. The schema includes fields such as `organism_id`, `category`, `trigger_type`, `tokens_path`, `trigger`, `propensity`, `question`, and `answer`. The headline result shows that the LoRAcle method improved pass@K from 10% to 40%.
提供机构:
ceselder



