ceselder/loracle-trigger-inversion-rollouts-step10
收藏Hugging Face2026-04-30 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/ceselder/loracle-trigger-inversion-rollouts-step10
下载链接
链接失效反馈官方服务:
资源简介:
LoRAcle: Trigger Inversion Rollouts (step_10 ckpt)数据集包含来自LoRAcle管道最佳检查点(drgrpo_v7_h200_main的第10步)的滚动数据,用于在20个未见过的IA后门LoRA上进行触发恢复。LoRAcle读取权重差异(rank-16 SVD方向令牌,无激活),并被提示命名激活每个后门的触发条件。数据集包含160个滚动数据(20个组织×2个提示×2个滚动×2次运行),每次运行的匹配率为run_1=35%(7/20),run_2=25%(5/20),平均为30%。4个组织每次都能恢复(任务形触发器),12个从未恢复(任意词汇触发器——rank-16 SVD瓶颈的直接证据)。方法概述为:预训练→SFT预热→Dr. GRPO RL,基于Qwen3-14B-LoRA-as-input的loracle,RL中无演示,lr=7e-6,K=16,judge=sonnet-4.6,使用二进制触发器存在性+实体质量评分标准。
The LoRAcle: Trigger Inversion Rollouts (step_10 ckpt) dataset contains rollouts from the LoRAcle pipelines best checkpoint (step_10 of `drgrpo_v7_h200_main`) running trigger-recovery on 20 held-out IA backdoor LoRAs never seen in SFT or RL training. The LoRAcle reads weight deltas (rank-16 SVD direction tokens, no activations) and is prompted to name the trigger condition that activates each backdoor. The dataset includes 160 rollouts (20 orgs × 2 prompts × 2 rollouts × 2 runs), with per-run any-match rates of run_1=35% (7/20), run_2=25% (5/20), and avg=30%. 4 organisms recover every time (task-shaped triggers), while 12 never recover (arbitrary lexical triggers — direct evidence of the rank-16 SVD bottleneck). The method is summarized as: `pretrain → SFT warmstart → Dr. GRPO RL` on a Qwen3-14B-LoRA-as-input loracle, no demos in RL, lr=7e-6, K=16, judge=sonnet-4.6 with a binary trigger-existence + entity-quality rubric.
提供机构:
ceselder



