codelion/synth-10M
收藏Hugging Face2025-11-11 更新2025-11-15 收录
下载链接:
https://hf-mirror.com/datasets/codelion/synth-10M
下载链接
链接失效反馈官方服务:
资源简介:
PleIAs/SYNTH采样子集是一个包含大约14,631,489个标记的数据集,从PleIAs/SYNTH原始数据集中通过水库抽样方法得到。每个样本包含查询、查询种子文本、合成推理和合成答案四个字段的组合,提供了完整的上下文、推理和答案。数据集包含多种练习类型,如记忆、选择题、创意写作、数学练习等,并支持多种语言,约80%为英语,其余为西班牙语、德语、法语、波兰语、意大利语、荷兰语、拉丁语等。适用于小规模推理模型预训练、合成数据实验、数据集组成研究、快速原型设计和测试、低成本训练运行以及多语言模型开发。
The PleIAs/SYNTH Sampled Subset is a dataset containing approximately 14,631,489 tokens, derived from the PleIAs/SYNTH original dataset using reservoir sampling. Each sample comprises a combination of four fields: query, query seed text, synthetic reasoning, and synthetic answer, providing full context, reasoning, and answers. The dataset includes various exercise types such as memorization, multiple-choice questions, creative writing, math exercises, and more, supporting multiple languages with approximately 80% English content and the rest in Spanish, German, French, Polish, Italian, Dutch, Latin, etc. It is suitable for small-scale reasoning model pretraining, synthetic data experiments, dataset composition studies, rapid prototyping and testing, low-cost training runs, and multilingual model development.
提供机构:
codelion



