five

ceselder/loracle-pretrain-mix

收藏
Hugging Face2026-04-24 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/ceselder/loracle-pretrain-mix
下载链接
链接失效反馈
官方服务:
资源简介:
loracle-pretrain-mix 是一个用于 LoRACLE 模型的预训练语料库,LoRACLE 是一种通过读取方向令牌来描述 LoRA 适配器训练内容的权重可解释性模型。每个示例都是一个(方向令牌输入,内容描述)对,用于训练时;在推理时,LoRACLE 仅看到权重增量并被要求描述它们。数据集包含训练集(50,000 行)、dpo_heldout 集(500 行)和验证集(100 行),每个生物体贡献恰好 2 行(Slot A + Slot B)。生物体是通过 1-20 个预训练文档的捆绑定义的模拟 LoRA 适配器。数据集还包括详细的 QA 模式、寄存器多样性、生成过程、质量保证和已知注意事项。

loracle-pretrain-mix is a pretraining corpus for the LoRACLE — a weight-reading interpretability model that describes what a LoRA adapter was trained on by reading its direction tokens. Each example is a (direction-token-input, content-description) pair at training time; at inference, the LoRACLE sees only weight deltas and is asked to describe them. The dataset includes splits for train (50,000 rows), dpo_heldout (500 rows), and val (100 rows), with every organism contributing exactly 2 rows (Slot A + Slot B). Organisms are simulated LoRA adapters defined by a bundle of 1–20 pretraining documents. The dataset also details the QA schema, register diversity, generation process, quality guarantees, and known caveats.
提供机构:
ceselder
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作