five

synth-100M

收藏
魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/codelion/synth-100M
下载链接
链接失效反馈
官方服务:
资源简介:
# PleIAs/SYNTH Sampled Dataset (100,000,000 tokens) This is a sampled subset of [PleIAs/SYNTH](https://huggingface.co/datasets/PleIAs/SYNTH) containing approximately **109,149,965 tokens**. ## Dataset Details ### Source - **Original Dataset**: PleIAs/SYNTH (~87B tokens, 79.6M samples) - **Sampling Method**: Reservoir sampling (unbiased random sampling) - **Target Token Count**: 100,000,000 tokens - **Actual Token Count**: 109,149,965 tokens - **Tokenizer**: GPT-2 (50,257 vocabulary) ### Sampling Statistics - **Documents Sampled**: 100,000 - **Documents Processed**: 100,000 - **Tokens Processed**: 109,149,965 - **Sampling Rate**: 1.0000 - **Random Seed**: 42 ### Text Field Combination Each sample combines four fields from the original SYNTH dataset: 1. **query**: The question or prompt 2. **query_seed_text**: Wikipedia or reference context 3. **synthetic_reasoning**: Step-by-step reasoning trace 4. **synthetic_answer**: Final answer This creates comprehensive training examples with full context, reasoning, and answers. ### Sampling Method This dataset was created using **reservoir sampling**, which ensures: - ✅ Unbiased random sample from the full dataset - ✅ Every document has equal probability of being selected - ✅ No distribution bias (early/late documents equally represented) - ✅ Efficient processing of 500 parquet files The sampling algorithm: 1. Streams through all 500 PleIAs/SYNTH parquet files 2. Combines four text fields into comprehensive training examples 3. Uses GPT-2 tokenizer to count tokens per document 4. Maintains a reservoir of documents until target token count 5. For each new document, replaces reservoir items with probability k/n - k = reservoir size, n = total documents seen 6. Guarantees uniform random sample across entire dataset ## Usage ```python from datasets import load_dataset # Load the dataset dataset = load_dataset("codelion/synth-100M") # Access the training data for example in dataset['train']: print(example['text']) print(f"Language: {example['language']}") print(f"Exercise type: {example['exercise']}") ``` ## Dataset Structure Each example contains: - `text`: Combined text (query + context + reasoning + answer) - `synth_id`: Original SYNTH dataset ID - `language`: Language code (en, es, de, fr, pl, it, nl, la, etc.) - `exercise`: Type of exercise (memorization, mcq, creative writing, math, rag, etc.) ## Exercise Types The dataset includes diverse synthetic tasks: - **Memorization**: Question-answering with Wikipedia context - **MCQ**: Multiple choice questions - **Creative Writing**: Poetry, stories, creative prompts - **Math Exercise**: Word problems with step-by-step solutions - **RAG**: Retrieval-augmented generation tasks - **Constrained Writing**: Writing with specific constraints - **Editing**: Text editing and improvement tasks ## Languages Approximately 80% English with multilingual content in: - Spanish (es) - German (de) - French (fr) - Polish (pl) - Italian (it) - Dutch (nl) - Latin (la) - And more ## Use Cases This sampled dataset is ideal for: - 🧠 Small-scale reasoning model pretraining - 🔬 Synthetic data experiments - 📊 Dataset composition studies - ⚡ Quick prototyping and testing - 💰 Low-cost training runs - 🌍 Multilingual model development ## Citation If you use this dataset, please cite both the original SYNTH dataset and mention the sampling methodology: ```bibtex @dataset{synth_sampled_100000000, title={PleIAs/SYNTH Sampled Dataset (100,000,000 tokens)}, author={CodeLion}, year={2025}, howpublished={\url{https://huggingface.co/datasets/codelion/synth-100M}}, note={Sampled from PleIAs/SYNTH using reservoir sampling} } @dataset{synth_original, title={SYNTH: The First Open Generalist Synthetic Dataset}, author={PleIAs}, year={2025}, howpublished={\url{https://huggingface.co/datasets/PleIAs/SYNTH}} } ``` ## License Apache 2.0 (same as original SYNTH dataset) ## Dataset Card Authors CodeLion ## Dataset Card Contact For questions or issues, please open an issue on the dataset repository.

# PleIAs/SYNTH 采样数据集(1亿Token) 这是[PleIAs/SYNTH](https://huggingface.co/datasets/PleIAs/SYNTH)的一个采样子集,包含约**109,149,965个Token**。 ## 数据集详情 ### 来源 - **原始数据集**:PleIAs/SYNTH(约870亿Token,7960万样本) - **采样方法**:水库采样(Reservoir sampling,无偏随机采样) - **目标Token数**:1亿 - **实际Token数**:109,149,965 - **分词器**:GPT-2分词器(Tokenizer) ### 采样统计 - **采样文档数**:100,000 - **处理文档数**:100,000 - **处理Token数**:109,149,965 - **采样率**:1.0000 - **随机种子**:42 ### 文本字段整合 每个样本整合了原始SYNTH数据集的四个字段: 1. **query**:问题或提示词 2. **query_seed_text**:维基百科或参考上下文 3. **synthetic_reasoning**:分步推理过程 4. **synthetic_answer**:最终答案 这构建了包含完整上下文、推理过程和答案的综合训练样本。 ### 采样方法 本数据集采用水库采样(Reservoir sampling)构建,该方法确保: - ✅ 无偏随机采样自完整数据集 - ✅ 每个文档被选中的概率均等 - ✅ 无分布偏差(早期/后期文档均得到平等代表) - ✅ 高效处理500个Parquet文件 采样算法: 1. 流式处理所有500个PleIAs/SYNTH Parquet文件 2. 将四个文本字段整合成综合训练样本 3. 使用GPT-2分词器统计每个文档的Token数 4. 维护一个文档水库直至达到目标Token数 5. 对于每个新文档,以k/n的概率替换水库中的项 - k = 水库大小,n = 已见文档总数 6. 保证整个数据集的均匀随机采样 ## 使用方法 python from datasets import load_dataset # Load the dataset dataset = load_dataset("codelion/synth-100M") # Access the training data for example in dataset['train']: print(example['text']) print(f"Language: {example['language']}") print(f"Exercise type: {example['exercise']}") ## 数据集结构 每个样本包含: - `text`:整合文本(查询+上下文+推理+答案) - `synth_id`:原始SYNTH数据集ID - `language`:语言代码(en、es、de、fr、pl、it、nl、la等) - `exercise`:练习类型(记忆、MCQ、创意写作、数学、RAG等) ## 练习类型 数据集包含多种合成任务: - **记忆**:基于维基百科上下文的问答 - **MCQ**:多项选择题 - **创意写作**:诗歌、故事、创意提示词 - **数学练习**:带分步解答的文字题 - **RAG**:检索增强生成(RAG,Retrieval-augmented generation)任务 - **受限写作**:带特定约束的写作 - **编辑**:文本编辑与改进任务 ## 语言 约80%为英语,多语言内容包括: - 西班牙语(es) - 德语(de) - 法语(fr) - 波兰语(pl) - 意大利语(it) - 荷兰语(nl) - 拉丁语(la) - 以及更多 ## 适用场景 本采样数据集适用于: - 🧠 小规模推理模型预训练 - 🔬 合成数据实验 - 📊 数据集构成研究 - ⚡ 快速原型设计与测试 - 💰 低成本训练运行 - 🌍 多语言模型开发 ## 引用 若使用本数据集,请同时引用原始SYNTH数据集并注明采样方法: bibtex @dataset{synth_sampled_100000000, title={PleIAs/SYNTH Sampled Dataset (100,000,000 tokens)}, author={CodeLion}, year={2025}, howpublished={url{https://huggingface.co/datasets/codelion/synth-100M}}, note={Sampled from PleIAs/SYNTH using reservoir sampling} } @dataset{synth_original, title={SYNTH: The First Open Generalist Synthetic Dataset}, author={PleIAs}, year={2025}, howpublished={url{https://huggingface.co/datasets/PleIAs/SYNTH}} } ## 许可协议 Apache 2.0(与原始SYNTH数据集相同) ## 数据集卡片作者 CodeLion ## 数据集卡片联系方式 如有疑问或问题,请在数据集仓库中提交issue。
提供机构:
maas
创建时间:
2025-11-11
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作