five

ceselder/loracle-pretrain-qa-v3h-preview1k

收藏
Hugging Face2026-04-23 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/ceselder/loracle-pretrain-qa-v3h-preview1k
下载链接
链接失效反馈
官方服务:
资源简介:
loracle-pretrain-qa-v3h-preview1k数据集是一个包含1,003行数据和400个组织的预训练问答数据集,是v3c版本的迭代。主要变化包括毒源改为分层RP-V2、仅限英语、增加了注册多样性、调整了回答密度、改为第三人称、增加了风格混合、缩短了T3项目符号、减少了最大文档数、减少了有毒组织比例、加强了CSAM过滤。数据集包含17,574个英文文档,来自FineFineWeb、Wikipedia和分层RP-V2毒源。每个组织的行分布包括T1_prose_summary、T2_complement等,平均每个组织2.5行。质量指标显示无源描述污染,第三人称清洁度99.3%,T5是/否平衡约50/50。数据集分为训练集、dpo_heldout和测试集。生成使用了claude-haiku-4-5模型,成本约1.50美元,时间约7分钟。

The loracle-pretrain-qa-v3h-preview1k dataset is a pretraining QA dataset containing 1,003 rows and 400 organisms, representing an iteration on v3c. Key changes include: toxic source changed to stratified RP-V2, English-only scope, increased register diversity, adjusted answer density, third-person voice, added style mix, shortened T3 bullets, reduced max docs per org, decreased toxic orgs percentage, and strengthened CSAM filtering. The corpus consists of 17,574 English documents from FineFineWeb, Wikipedia, and stratified RP-V2 toxic sources. Per-org row distribution includes T1_prose_summary, T2_complement, etc., averaging 2.5 rows per org. Quality metrics show 0 source-description contamination, 99.3% clean third-person, and ~50/50 T5 yes/no balance. The dataset is split into train, dpo_heldout, and test sets. Generation used the claude-haiku-4-5 model, costing ~$1.50 and taking ~7 minutes.
提供机构:
ceselder
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作