codelion/synth-100M
收藏Hugging Face2025-11-11 更新2025-11-15 收录
下载链接:
https://hf-mirror.com/datasets/codelion/synth-100M
下载链接
链接失效反馈官方服务:
资源简介:
PleIAs/SYNTH采样数据集是一个包含大约109,149,965个token的子集,从PleIAs/SYNTH数据集中通过reservoir sampling方法采样得到。每个样本包括问题或提示、维基百科或参考上下文、逐步推理轨迹和最终答案四个字段。支持多种语言,包括英语、西班牙语、德语、法语、波兰语、意大利语、荷兰语、拉丁语等,其中英语约占80%。适用于小规模推理模型预训练、合成数据实验等多种场景。
The PleIAs/SYNTH Sampled Dataset is a subset containing approximately 109,149,965 tokens, sampled from the PleIAs/SYNTH dataset using reservoir sampling. Each sample includes four fields: query, query_seed_text, synthetic_reasoning, and synthetic_answer. It supports multiple languages including English, Spanish, German, French, Polish, Italian, Dutch, Latin, etc., with English accounting for about 80%. It is suitable for various scenarios such as small-scale reasoning model pretraining and synthetic data experiments.
提供机构:
codelion



