ceselder/loracle-onpolicy-rollouts
收藏Hugging Face2026-03-22 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ceselder/loracle-onpolicy-rollouts
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: prompt_id
dtype: string
- name: prompt_type
dtype: string
- name: user_message
dtype: string
- name: response
dtype: string
- name: system_prompt
dtype: string
- name: category
dtype: string
- name: behavior_description
dtype: string
splits:
- name: train
num_examples: 146590
license: mit
task_categories:
- text-generation
tags:
- loracle
- lora
- mechinterp
- safety
- on-policy
---
# Loracle On-Policy Rollouts
Responses generated by **trained behavioral LoRAs** on held-out prompts. Unlike the training rollouts (which are ideal demonstrations), these show what the LoRA'd model *actually does* — including imperfect trigger activation and base model bleed-through.
## Generation
- **Base model**: Qwen3-14B
- **LoRA training**: Rank 4, 4 epochs at lr=1e-3 (undertrained — triggers fire ~50-60% of the time)
- **Generation**: Each trained LoRA generated 16 responses on a mix of prompt types
- **Known issues**: Some responses contain leaked think tags from Qwen3's thinking mode. Right-padding was used instead of left-padding for batched generation, which may slightly degrade quality.
## Prompt Types
Each LoRA generates responses to 16 prompts:
- **2 EM probes**: Emergent misalignment test messages
- **8 WildChat**: Diverse real-user-style messages from WildChat/LMSYS
- **3 trigger**: Messages that should activate the conditional behavior
- **3 normal**: Messages from the original training rollouts
## Schema
| Column | Description |
|--------|-------------|
| prompt_id | Unique ID linking to the behavioral prompt |
| prompt_type | One of: em, wildchat, trigger, normal |
| user_message | The input message |
| response | The LoRA'd model's actual response |
| system_prompt | The behavioral system prompt the LoRA was trained on |
| category | Behavior category |
| behavior_description | Human-readable description of intended behavior |
## Stats
- **146,590 rows** across **9,178 LoRAs**
- ~16 responses per LoRA
- LoRAs are undertrained (low LR, few epochs) so on-policy behavior is noisy
## Usage
Used as simulation training data for the loracle — the loracle learns to predict what the LoRA'd model would say given only the weight geometry (direction tokens).
Part of the [loracle collection](https://huggingface.co/collections/ceselder/loracle-69bfd4d905a4f1fa944371bf).
提供机构:
ceselder



