codelion/synth-100M

Name: codelion/synth-100M
Creator: codelion
Published: 2025-11-11 04:17:04
License: 暂无描述

Hugging Face2025-11-11 更新2025-11-15 收录

下载链接：

https://hf-mirror.com/datasets/codelion/synth-100M

下载链接

链接失效反馈

官方服务：

资源简介：

PleIAs/SYNTH采样数据集是一个包含大约109,149,965个token的子集，从PleIAs/SYNTH数据集中通过reservoir sampling方法采样得到。每个样本包括问题或提示、维基百科或参考上下文、逐步推理轨迹和最终答案四个字段。支持多种语言，包括英语、西班牙语、德语、法语、波兰语、意大利语、荷兰语、拉丁语等，其中英语约占80%。适用于小规模推理模型预训练、合成数据实验等多种场景。

The PleIAs/SYNTH Sampled Dataset is a subset containing approximately 109,149,965 tokens, sampled from the PleIAs/SYNTH dataset using reservoir sampling. Each sample includes four fields: query, query_seed_text, synthetic_reasoning, and synthetic_answer. It supports multiple languages including English, Spanish, German, French, Polish, Italian, Dutch, Latin, etc., with English accounting for about 80%. It is suitable for various scenarios such as small-scale reasoning model pretraining and synthetic data experiments.

提供机构：

codelion

5,000+

优质数据集

54 个

任务类型

进入经典数据集