five

synth-10M

收藏
魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/codelion/synth-10M
下载链接
链接失效反馈
官方服务:
资源简介:
# PleIAs/SYNTH Sampled Dataset (10,000,000 tokens) This is a sampled subset of [PleIAs/SYNTH](https://huggingface.co/datasets/PleIAs/SYNTH) containing approximately **14,631,489 tokens**. ## Dataset Details ### Source - **Original Dataset**: PleIAs/SYNTH (~87B tokens, 79.6M samples) - **Sampling Method**: Reservoir sampling (unbiased random sampling) - **Target Token Count**: 10,000,000 tokens - **Actual Token Count**: 14,631,489 tokens - **Tokenizer**: GPT-2 (50,257 vocabulary) ### Sampling Statistics - **Documents Sampled**: 13,345 - **Documents Processed**: 13,345 - **Tokens Processed**: 14,631,489 - **Sampling Rate**: 1.0000 - **Random Seed**: 42 ### Text Field Combination Each sample combines four fields from the original SYNTH dataset: 1. **query**: The question or prompt 2. **query_seed_text**: Wikipedia or reference context 3. **synthetic_reasoning**: Step-by-step reasoning trace 4. **synthetic_answer**: Final answer This creates comprehensive training examples with full context, reasoning, and answers. ### Sampling Method This dataset was created using **reservoir sampling**, which ensures: - ✅ Unbiased random sample from the full dataset - ✅ Every document has equal probability of being selected - ✅ No distribution bias (early/late documents equally represented) - ✅ Efficient processing of 500 parquet files The sampling algorithm: 1. Streams through all 500 PleIAs/SYNTH parquet files 2. Combines four text fields into comprehensive training examples 3. Uses GPT-2 tokenizer to count tokens per document 4. Maintains a reservoir of documents until target token count 5. For each new document, replaces reservoir items with probability k/n - k = reservoir size, n = total documents seen 6. Guarantees uniform random sample across entire dataset ## Usage ```python from datasets import load_dataset # Load the dataset dataset = load_dataset("codelion/synth-10M") # Access the training data for example in dataset['train']: print(example['text']) print(f"Language: {example['language']}") print(f"Exercise type: {example['exercise']}") ``` ## Dataset Structure Each example contains: - `text`: Combined text (query + context + reasoning + answer) - `synth_id`: Original SYNTH dataset ID - `language`: Language code (en, es, de, fr, pl, it, nl, la, etc.) - `exercise`: Type of exercise (memorization, mcq, creative writing, math, rag, etc.) ## Exercise Types The dataset includes diverse synthetic tasks: - **Memorization**: Question-answering with Wikipedia context - **MCQ**: Multiple choice questions - **Creative Writing**: Poetry, stories, creative prompts - **Math Exercise**: Word problems with step-by-step solutions - **RAG**: Retrieval-augmented generation tasks - **Constrained Writing**: Writing with specific constraints - **Editing**: Text editing and improvement tasks ## Languages Approximately 80% English with multilingual content in: - Spanish (es) - German (de) - French (fr) - Polish (pl) - Italian (it) - Dutch (nl) - Latin (la) - And more ## Use Cases This sampled dataset is ideal for: - 🧠 Small-scale reasoning model pretraining - 🔬 Synthetic data experiments - 📊 Dataset composition studies - ⚡ Quick prototyping and testing - 💰 Low-cost training runs - 🌍 Multilingual model development ## Citation If you use this dataset, please cite both the original SYNTH dataset and mention the sampling methodology: ```bibtex @dataset{synth_sampled_10000000, title={PleIAs/SYNTH Sampled Dataset (10,000,000 tokens)}, author={CodeLion}, year={2025}, howpublished={\url{https://huggingface.co/datasets/codelion/synth-10M}}, note={Sampled from PleIAs/SYNTH using reservoir sampling} } @dataset{synth_original, title={SYNTH: The First Open Generalist Synthetic Dataset}, author={PleIAs}, year={2025}, howpublished={\url{https://huggingface.co/datasets/PleIAs/SYNTH}} } ``` ## License Apache 2.0 (same as original SYNTH dataset) ## Dataset Card Authors CodeLion ## Dataset Card Contact For questions or issues, please open an issue on the dataset repository.

# PleIAs/SYNTH 采样数据集(10,000,000 Token) 本数据集为 [PleIAs/SYNTH](https://huggingface.co/datasets/PleIAs/SYNTH) 的采样子集,共包含约 **14,631,489 个Token**。 ## 数据集详情 ### 数据源 - **原始数据集**:PleIAs/SYNTH(约870亿Token,7960万条样本) - **采样方法**:蓄水池采样(无偏随机采样) - **目标Token数量**:10,000,000 - **实际Token数量**:14,631,489 - **分词器**:GPT-2(词汇量50,257) ### 采样统计信息 - **采样文档数**:13,345 - **处理文档数**:13,345 - **处理Token总数**:14,631,489 - **采样率**:1.0000 - **随机种子**:42 ### 文本字段组合方式 每条样本合并了原始SYNTH数据集的四个字段: 1. **query**:问题或提示词 2. **query_seed_text**:维基百科或参考上下文 3. **synthetic_reasoning**:逐步推理轨迹 4. **synthetic_answer**:最终答案 由此可生成包含完整上下文、推理过程与答案的高质量训练样本。 ### 采样方法说明 本数据集采用**蓄水池采样**方法构建,可确保: - ✅ 对完整数据集进行无偏随机采样 - ✅ 每份文档被选中的概率均等 - ✅ 无分布偏移(早期与晚期文档均能被均匀覆盖) - ✅ 高效处理500个Parquet文件 采样算法步骤如下: 1. 流式遍历全部500个PleIAs/SYNTH Parquet文件 2. 合并四个文本字段以生成完整训练样本 3. 使用GPT-2分词器统计每份文档的Token数量 4. 维护一个蓄水池存储文档,直至达到目标Token数量 5. 对于每一份新文档,以概率k/n替换蓄水池中的现有条目 - k = 蓄水池容量,n = 已遍历的总文档数 6. 可保证对全数据集的均匀随机采样 ## 使用方法 python from datasets import load_dataset # 加载数据集 dataset = load_dataset("codelion/synth-10M") # 访问训练数据 for example in dataset['train']: print(example['text']) print(f"语言:{example['language']}") print(f"任务类型:{example['exercise']}") ## 数据集结构 每条样本包含以下字段: - `text`:合并后的文本(query + 上下文 + 推理过程 + 答案) - `synth_id`:原始SYNTH数据集的唯一标识符 - `language`:语言代码(en、es、de、fr、pl、it、nl、la等) - `exercise`:任务类型(记忆型、选择题、创意写作、数学题、RAG等) ## 任务类型 本数据集包含多样化的合成任务: - **记忆型任务**:基于维基百科上下文的问答任务 - **MCQ**:多项选择题(Multiple Choice Questions) - **创意写作**:诗歌、故事、创意提示类写作 - **数学题**:带有逐步解题步骤的应用题 - **RAG**:检索增强生成(Retrieval-Augmented Generation)任务 - **约束写作**:带有特定约束条件的写作任务 - **文本编辑**:文本修改与优化任务 ## 支持语言 约80%的样本为英文,其余多语言内容覆盖: - 西班牙语(es) - 德语(de) - 法语(fr) - 波兰语(pl) - 意大利语(it) - 荷兰语(nl) - 拉丁语(la) - 及其他语种 ## 应用场景 本采样数据集适用于: - 🧠 小规模推理模型预训练 - 🔬 合成数据相关实验 - 📊 数据集组合研究 - ⚡ 快速原型开发与测试 - 💰 低成本训练运行 - 🌍 多语言模型开发 ## 引用规范 若使用本数据集,请同时引用原始SYNTH数据集并注明采样方法: bibtex @dataset{synth_sampled_10000000, title={PleIAs/SYNTH Sampled Dataset (10,000,000 tokens)}, author={CodeLion}, year={2025}, howpublished={url{https://huggingface.co/datasets/codelion/synth-10M}}, note={Sampled from PleIAs/SYNTH using reservoir sampling} } @dataset{synth_original, title={SYNTH: The First Open Generalist Synthetic Dataset}, author={PleIAs}, year={2025}, howpublished={url{https://huggingface.co/datasets/PleIAs/SYNTH}} } ## 许可证 Apache 2.0(与原始SYNTH数据集一致) ## 数据集卡片作者 CodeLion ## 数据集卡片联系方式 如有疑问或问题,请在数据集仓库中提交Issue。
提供机构:
maas
创建时间:
2025-11-11
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作