five

ceselder/loracle-onpolicy-rollouts

收藏
Hugging Face2026-03-22 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ceselder/loracle-onpolicy-rollouts
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: prompt_id dtype: string - name: prompt_type dtype: string - name: user_message dtype: string - name: response dtype: string - name: system_prompt dtype: string - name: category dtype: string - name: behavior_description dtype: string splits: - name: train num_examples: 146590 license: mit task_categories: - text-generation tags: - loracle - lora - mechinterp - safety - on-policy --- # Loracle On-Policy Rollouts Responses generated by **trained behavioral LoRAs** on held-out prompts. Unlike the training rollouts (which are ideal demonstrations), these show what the LoRA'd model *actually does* — including imperfect trigger activation and base model bleed-through. ## Generation - **Base model**: Qwen3-14B - **LoRA training**: Rank 4, 4 epochs at lr=1e-3 (undertrained — triggers fire ~50-60% of the time) - **Generation**: Each trained LoRA generated 16 responses on a mix of prompt types - **Known issues**: Some responses contain leaked think tags from Qwen3's thinking mode. Right-padding was used instead of left-padding for batched generation, which may slightly degrade quality. ## Prompt Types Each LoRA generates responses to 16 prompts: - **2 EM probes**: Emergent misalignment test messages - **8 WildChat**: Diverse real-user-style messages from WildChat/LMSYS - **3 trigger**: Messages that should activate the conditional behavior - **3 normal**: Messages from the original training rollouts ## Schema | Column | Description | |--------|-------------| | prompt_id | Unique ID linking to the behavioral prompt | | prompt_type | One of: em, wildchat, trigger, normal | | user_message | The input message | | response | The LoRA'd model's actual response | | system_prompt | The behavioral system prompt the LoRA was trained on | | category | Behavior category | | behavior_description | Human-readable description of intended behavior | ## Stats - **146,590 rows** across **9,178 LoRAs** - ~16 responses per LoRA - LoRAs are undertrained (low LR, few epochs) so on-policy behavior is noisy ## Usage Used as simulation training data for the loracle — the loracle learns to predict what the LoRA'd model would say given only the weight geometry (direction tokens). Part of the [loracle collection](https://huggingface.co/collections/ceselder/loracle-69bfd4d905a4f1fa944371bf).
提供机构:
ceselder
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作