ceselder/loracle-pretrain-mix-oneq
收藏Hugging Face2026-04-26 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/ceselder/loracle-pretrain-mix-oneq
下载链接
链接失效反馈官方服务:
资源简介:
loracle-pretrain-mix数据集是用于LoRACLE(一种权重读取可解释性模型)的预训练语料库,旨在通过读取LoRA适配器的方向令牌来描述其训练内容。每个示例是一个(方向令牌输入,内容描述)对。数据集是oneq子样本,每个organism_id随机选择一个QA行(使用确定性洗牌,种子为42,然后去重),适用于训练步骤中每个模型暴露于新有机体的场景。数据集包含三个分割:train(25000行,25000个有机体)、dpo_heldout(250行,250个有机体)和val(50行,50个有机体)。有机体由1-20个预训练文档捆绑定义,这些文档来自FineFineWeb(1.98M英文文档)和RedPajama-V2(100k有毒文档)语料库,并经过聚类和采样。QA模式包括每个有机体的两个行:Slot A(摘要,75% T1_prose_summary和25% T1_detailed)和Slot B(额外内容,加权混合T6_free、T5_yesno、T4_classify、T3_bullet、T0_terse和T2_complement)。问答类型具有固定措辞和释义池,以避免过拟合。生成使用Claude Haiku-4-5 via Batch API,并有两轮管道。质量保证包括MinHash-LSH去重、模式完整性和分割不泄漏。数据集中包含约5%的有毒行,且所有生成都是第三人称寄存器。
The loracle-pretrain-mix dataset is a pretraining corpus for the LoRACLE—a weight-reading interpretability model that describes what a LoRA adapter was trained on by reading its direction tokens. Each example is a (direction-token-input, content-description) pair. The dataset is the oneq subsample, with one randomly-selected QA row per organism_id (deterministic shuffle with seed=42, then drop_duplicates), built for scale ablations where each training step exposes the model to a fresh organism. It includes three splits: train (25000 rows, 25000 organisms), dpo_heldout (250 rows, 250 organisms), and val (50 rows, 50 organisms). Organisms are simulated LoRA adapters defined by a bundle of 1–20 pretraining documents from the FineFineWeb (1.98M English docs) and RedPajama-V2 (100k toxic docs) corpora, clustered and sampled. The QA schema has two rows per organism: Slot A (summary, 75% T1_prose_summary and 25% T1_detailed) and Slot B (extra, weighted across T6_free, T5_yesno, T4_classify, T3_bullet, T0_terse, and T2_complement). Question types have fixed phrasing with paraphrase pools to prevent overfitting. Generation uses Claude Haiku-4-5 via Batch API with a two-round pipeline. Quality guarantees include MinHash-LSH deduplication, schema integrity, and disjoint splits on organism_id. The dataset contains about 5% toxic rows, and all generation is in third-person register.
提供机构:
ceselder



