ceselder/loracle-pretrain-mix

Name: ceselder/loracle-pretrain-mix
Creator: ceselder
Published: 2026-04-24 01:50:11
License: 暂无描述

Hugging Face2026-04-24 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/ceselder/loracle-pretrain-mix

下载链接

链接失效反馈

官方服务：

资源简介：

loracle-pretrain-mix 是一个用于 LoRACLE 模型的预训练语料库，LoRACLE 是一种通过读取方向令牌来描述 LoRA 适配器训练内容的权重可解释性模型。每个示例都是一个（方向令牌输入，内容描述）对，用于训练时；在推理时，LoRACLE 仅看到权重增量并被要求描述它们。数据集包含训练集（50,000 行）、dpo_heldout 集（500 行）和验证集（100 行），每个生物体贡献恰好 2 行（Slot A + Slot B）。生物体是通过 1-20 个预训练文档的捆绑定义的模拟 LoRA 适配器。数据集还包括详细的 QA 模式、寄存器多样性、生成过程、质量保证和已知注意事项。

loracle-pretrain-mix is a pretraining corpus for the LoRACLE — a weight-reading interpretability model that describes what a LoRA adapter was trained on by reading its direction tokens. Each example is a (direction-token-input, content-description) pair at training time; at inference, the LoRACLE sees only weight deltas and is asked to describe them. The dataset includes splits for train (50,000 rows), dpo_heldout (500 rows), and val (100 rows), with every organism contributing exactly 2 rows (Slot A + Slot B). Organisms are simulated LoRA adapters defined by a bundle of 1–20 pretraining documents. The dataset also details the QA schema, register diversity, generation process, quality guarantees, and known caveats.

提供机构：

ceselder

5,000+

优质数据集

54 个

任务类型

进入经典数据集