synth-100M

Name: synth-100M
Creator: maas
Published: 2025-12-05 16:56:20
License: 暂无描述

魔搭社区2025-12-05 更新2025-12-06 收录

下载链接：

https://modelscope.cn/datasets/codelion/synth-100M

下载链接

链接失效反馈

官方服务：

资源简介：

# PleIAs/SYNTH Sampled Dataset (100,000,000 tokens) This is a sampled subset of [PleIAs/SYNTH](https://huggingface.co/datasets/PleIAs/SYNTH) containing approximately **109,149,965 tokens**. ## Dataset Details ### Source - **Original Dataset**: PleIAs/SYNTH (~87B tokens, 79.6M samples) - **Sampling Method**: Reservoir sampling (unbiased random sampling) - **Target Token Count**: 100,000,000 tokens - **Actual Token Count**: 109,149,965 tokens - **Tokenizer**: GPT-2 (50,257 vocabulary) ### Sampling Statistics - **Documents Sampled**: 100,000 - **Documents Processed**: 100,000 - **Tokens Processed**: 109,149,965 - **Sampling Rate**: 1.0000 - **Random Seed**: 42 ### Text Field Combination Each sample combines four fields from the original SYNTH dataset: 1. **query**: The question or prompt 2. **query_seed_text**: Wikipedia or reference context 3. **synthetic_reasoning**: Step-by-step reasoning trace 4. **synthetic_answer**: Final answer This creates comprehensive training examples with full context, reasoning, and answers. ### Sampling Method This dataset was created using **reservoir sampling**, which ensures: - ✅ Unbiased random sample from the full dataset - ✅ Every document has equal probability of being selected - ✅ No distribution bias (early/late documents equally represented) - ✅ Efficient processing of 500 parquet files The sampling algorithm: 1. Streams through all 500 PleIAs/SYNTH parquet files 2. Combines four text fields into comprehensive training examples 3. Uses GPT-2 tokenizer to count tokens per document 4. Maintains a reservoir of documents until target token count 5. For each new document, replaces reservoir items with probability k/n - k = reservoir size, n = total documents seen 6. Guarantees uniform random sample across entire dataset ## Usage ```python from datasets import load_dataset # Load the dataset dataset = load_dataset("codelion/synth-100M") # Access the training data for example in dataset['train']: print(example['text']) print(f"Language: {example['language']}") print(f"Exercise type: {example['exercise']}") ``` ## Dataset Structure Each example contains: - `text`: Combined text (query + context + reasoning + answer) - `synth_id`: Original SYNTH dataset ID - `language`: Language code (en, es, de, fr, pl, it, nl, la, etc.) - `exercise`: Type of exercise (memorization, mcq, creative writing, math, rag, etc.) ## Exercise Types The dataset includes diverse synthetic tasks: - **Memorization**: Question-answering with Wikipedia context - **MCQ**: Multiple choice questions - **Creative Writing**: Poetry, stories, creative prompts - **Math Exercise**: Word problems with step-by-step solutions - **RAG**: Retrieval-augmented generation tasks - **Constrained Writing**: Writing with specific constraints - **Editing**: Text editing and improvement tasks ## Languages Approximately 80% English with multilingual content in: - Spanish (es) - German (de) - French (fr) - Polish (pl) - Italian (it) - Dutch (nl) - Latin (la) - And more ## Use Cases This sampled dataset is ideal for: - 🧠 Small-scale reasoning model pretraining - 🔬 Synthetic data experiments - 📊 Dataset composition studies - ⚡ Quick prototyping and testing - 💰 Low-cost training runs - 🌍 Multilingual model development ## Citation If you use this dataset, please cite both the original SYNTH dataset and mention the sampling methodology: ```bibtex @dataset{synth_sampled_100000000, title={PleIAs/SYNTH Sampled Dataset (100,000,000 tokens)}, author={CodeLion}, year={2025}, howpublished={\url{https://huggingface.co/datasets/codelion/synth-100M}}, note={Sampled from PleIAs/SYNTH using reservoir sampling} } @dataset{synth_original, title={SYNTH: The First Open Generalist Synthetic Dataset}, author={PleIAs}, year={2025}, howpublished={\url{https://huggingface.co/datasets/PleIAs/SYNTH}} } ``` ## License Apache 2.0 (same as original SYNTH dataset) ## Dataset Card Authors CodeLion ## Dataset Card Contact For questions or issues, please open an issue on the dataset repository.

# PleIAs/SYNTH 采样数据集（1亿Token）这是[PleIAs/SYNTH](https://huggingface.co/datasets/PleIAs/SYNTH)的一个采样子集，包含约**109,149,965个Token**。 ## 数据集详情 ### 来源 - **原始数据集**：PleIAs/SYNTH（约870亿Token，7960万样本） - **采样方法**：水库采样（Reservoir sampling，无偏随机采样） - **目标Token数**：1亿 - **实际Token数**：109,149,965 - **分词器**：GPT-2分词器（Tokenizer） ### 采样统计 - **采样文档数**：100,000 - **处理文档数**：100,000 - **处理Token数**：109,149,965 - **采样率**：1.0000 - **随机种子**：42 ### 文本字段整合每个样本整合了原始SYNTH数据集的四个字段： 1. **query**：问题或提示词 2. **query_seed_text**：维基百科或参考上下文 3. **synthetic_reasoning**：分步推理过程 4. **synthetic_answer**：最终答案这构建了包含完整上下文、推理过程和答案的综合训练样本。 ### 采样方法本数据集采用水库采样（Reservoir sampling）构建，该方法确保： - ✅ 无偏随机采样自完整数据集 - ✅ 每个文档被选中的概率均等 - ✅ 无分布偏差（早期/后期文档均得到平等代表） - ✅ 高效处理500个Parquet文件采样算法： 1. 流式处理所有500个PleIAs/SYNTH Parquet文件 2. 将四个文本字段整合成综合训练样本 3. 使用GPT-2分词器统计每个文档的Token数 4. 维护一个文档水库直至达到目标Token数 5. 对于每个新文档，以k/n的概率替换水库中的项 - k = 水库大小，n = 已见文档总数 6. 保证整个数据集的均匀随机采样 ## 使用方法 python from datasets import load_dataset # Load the dataset dataset = load_dataset("codelion/synth-100M") # Access the training data for example in dataset['train']: print(example['text']) print(f"Language: {example['language']}") print(f"Exercise type: {example['exercise']}") ## 数据集结构每个样本包含： - `text`：整合文本（查询+上下文+推理+答案） - `synth_id`：原始SYNTH数据集ID - `language`：语言代码（en、es、de、fr、pl、it、nl、la等） - `exercise`：练习类型（记忆、MCQ、创意写作、数学、RAG等） ## 练习类型数据集包含多种合成任务： - **记忆**：基于维基百科上下文的问答 - **MCQ**：多项选择题 - **创意写作**：诗歌、故事、创意提示词 - **数学练习**：带分步解答的文字题 - **RAG**：检索增强生成（RAG，Retrieval-augmented generation）任务 - **受限写作**：带特定约束的写作 - **编辑**：文本编辑与改进任务 ## 语言约80%为英语，多语言内容包括： - 西班牙语（es） - 德语（de） - 法语（fr） - 波兰语（pl） - 意大利语（it） - 荷兰语（nl） - 拉丁语（la） - 以及更多 ## 适用场景本采样数据集适用于： - 🧠 小规模推理模型预训练 - 🔬 合成数据实验 - 📊 数据集构成研究 - ⚡ 快速原型设计与测试 - 💰 低成本训练运行 - 🌍 多语言模型开发 ## 引用若使用本数据集，请同时引用原始SYNTH数据集并注明采样方法： bibtex @dataset{synth_sampled_100000000, title={PleIAs/SYNTH Sampled Dataset (100,000,000 tokens)}, author={CodeLion}, year={2025}, howpublished={url{https://huggingface.co/datasets/codelion/synth-100M}}, note={Sampled from PleIAs/SYNTH using reservoir sampling} } @dataset{synth_original, title={SYNTH: The First Open Generalist Synthetic Dataset}, author={PleIAs}, year={2025}, howpublished={url{https://huggingface.co/datasets/PleIAs/SYNTH}} } ## 许可协议 Apache 2.0（与原始SYNTH数据集相同） ## 数据集卡片作者 CodeLion ## 数据集卡片联系方式如有疑问或问题，请在数据集仓库中提交issue。

提供机构：

maas

创建时间：

2025-11-11

5,000+

优质数据集

54 个

任务类型

进入经典数据集