ceselder/loracle-onpolicy-rollouts

Name: ceselder/loracle-onpolicy-rollouts
Creator: ceselder
Published: 2026-03-22 12:03:03
License: 暂无描述

Hugging Face2026-03-22 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/ceselder/loracle-onpolicy-rollouts

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: prompt_id dtype: string - name: prompt_type dtype: string - name: user_message dtype: string - name: response dtype: string - name: system_prompt dtype: string - name: category dtype: string - name: behavior_description dtype: string splits: - name: train num_examples: 146590 license: mit task_categories: - text-generation tags: - loracle - lora - mechinterp - safety - on-policy --- # Loracle On-Policy Rollouts Responses generated by **trained behavioral LoRAs** on held-out prompts. Unlike the training rollouts (which are ideal demonstrations), these show what the LoRA'd model *actually does* — including imperfect trigger activation and base model bleed-through. ## Generation - **Base model**: Qwen3-14B - **LoRA training**: Rank 4, 4 epochs at lr=1e-3 (undertrained — triggers fire ~50-60% of the time) - **Generation**: Each trained LoRA generated 16 responses on a mix of prompt types - **Known issues**: Some responses contain leaked think tags from Qwen3's thinking mode. Right-padding was used instead of left-padding for batched generation, which may slightly degrade quality. ## Prompt Types Each LoRA generates responses to 16 prompts: - **2 EM probes**: Emergent misalignment test messages - **8 WildChat**: Diverse real-user-style messages from WildChat/LMSYS - **3 trigger**: Messages that should activate the conditional behavior - **3 normal**: Messages from the original training rollouts ## Schema | Column | Description | |--------|-------------| | prompt_id | Unique ID linking to the behavioral prompt | | prompt_type | One of: em, wildchat, trigger, normal | | user_message | The input message | | response | The LoRA'd model's actual response | | system_prompt | The behavioral system prompt the LoRA was trained on | | category | Behavior category | | behavior_description | Human-readable description of intended behavior | ## Stats - **146,590 rows** across **9,178 LoRAs** - ~16 responses per LoRA - LoRAs are undertrained (low LR, few epochs) so on-policy behavior is noisy ## Usage Used as simulation training data for the loracle — the loracle learns to predict what the LoRA'd model would say given only the weight geometry (direction tokens). Part of the [loracle collection](https://huggingface.co/collections/ceselder/loracle-69bfd4d905a4f1fa944371bf).

提供机构：

ceselder

5,000+

优质数据集

54 个

任务类型

进入经典数据集