codelion/synth-10M

Name: codelion/synth-10M
Creator: codelion
Published: 2025-11-11 04:04:11
License: 暂无描述

Hugging Face2025-11-11 更新2025-11-15 收录

下载链接：

https://hf-mirror.com/datasets/codelion/synth-10M

下载链接

链接失效反馈

官方服务：

资源简介：

PleIAs/SYNTH采样子集是一个包含大约14,631,489个标记的数据集，从PleIAs/SYNTH原始数据集中通过水库抽样方法得到。每个样本包含查询、查询种子文本、合成推理和合成答案四个字段的组合，提供了完整的上下文、推理和答案。数据集包含多种练习类型，如记忆、选择题、创意写作、数学练习等，并支持多种语言，约80%为英语，其余为西班牙语、德语、法语、波兰语、意大利语、荷兰语、拉丁语等。适用于小规模推理模型预训练、合成数据实验、数据集组成研究、快速原型设计和测试、低成本训练运行以及多语言模型开发。

The PleIAs/SYNTH Sampled Subset is a dataset containing approximately 14,631,489 tokens, derived from the PleIAs/SYNTH original dataset using reservoir sampling. Each sample comprises a combination of four fields: query, query seed text, synthetic reasoning, and synthetic answer, providing full context, reasoning, and answers. The dataset includes various exercise types such as memorization, multiple-choice questions, creative writing, math exercises, and more, supporting multiple languages with approximately 80% English content and the rest in Spanish, German, French, Polish, Italian, Dutch, Latin, etc. It is suitable for small-scale reasoning model pretraining, synthetic data experiments, dataset composition studies, rapid prototyping and testing, low-cost training runs, and multilingual model development.

提供机构：

codelion

5,000+

优质数据集

54 个

任务类型

进入经典数据集