five

ceselder/loracle-pretrain-qa-v4-20k

收藏
Hugging Face2026-04-23 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/ceselder/loracle-pretrain-qa-v4-20k
下载链接
链接失效反馈
官方服务:
资源简介:
loracle-pretrain-qa-v4-20k是一个用于预训练和元建模的大规模问答数据集,包含约45,000-50,000个问答对。数据集基于1.65百万篇FineFineWeb文档,通过BGE-small-en-v1.5嵌入和球形k-means聚类方法生成21,000个主题聚类的organisms。这些organisms覆盖多种问题类型(T1-T6/T0),采用第三人称叙述风格。数据集中95%为清洁数据,5%注入了毒性内容以增强鲁棒性。每个问答对附带丰富的元数据,包括问题类型、文档来源、语言、毒性标记等。数据集按90%/5%/5%的比例划分为训练集、DPO保留集和测试集,并进行了去重处理。

loracle-pretrain-qa-v4-20k is a large-scale QA dataset for pretraining and meta-modeling, containing ~45-50k Q/A rows. Built from 1.65M FineFineWeb docs, it generates 21,000 topic-clustered organisms via BGE-small-en-v1.5 embeddings and spherical k-means (K=10000). The dataset covers multiple question types (T1-T6/T0) in third-person register, with 95% clean data and 5% toxicity-injected content for robustness. Each QA pair includes rich metadata like question type, document sources, languages, and toxicity flags. The data is split 90%/5%/5% into train/DPO-heldout/test sets with MinHash-LSH deduplication (threshold 0.85) on T1 questions.
提供机构:
ceselder
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作