five

qvac/GenesisI

收藏
Hugging Face2025-12-11 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/qvac/GenesisI
下载链接
链接失效反馈
官方服务:
资源简介:
QVAC Genesis I是一个大规模的教育-focused合成数据集(40.9B tokens;31.8M rows),旨在为LLM预训练和以推理为中心的后期训练而设计。它涵盖了从高中到大学/专业水平的数学、物理、生物、医学和逻辑推理领域。数据通过可扩展的从失败中学习的管道生成:种子→MCQs→模型回答→LLM-as-a-Judge提取→以失败分析为中心的教育内容,包括四种风格(教科书、问答、网页文章、对话)。与之前的开放合成语料库(如Cosmopedia)相比,Genesis I强调课程对齐、平衡的领域覆盖和针对模型实际失败的地方的丰富的教学解释。

QVAC Genesis I is a large-scale education-focused synthetic dataset (40.9B tokens; 31.8M rows) purpose-built for LLM pre-training and reasoning-centric post-training. It covers Mathematics, Physics, Biology, Medicine, and Logical Deduction across high-school and college/professional levels. Data are generated via a scalable learn-from-failures pipeline: seed → MCQs → model answering → LLM-as-a-Judge extraction → failure-analysis educational content in four styles (Textbook, Q&A, Web Article, Dialogue). Compared to prior open synthetic corpora (e.g., Cosmopedia), Genesis I emphasizes curriculum alignment, balanced domain coverage, and pedagogically rich explanations targeted at the places models actually fail.
提供机构:
qvac
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作