synth-10M
收藏魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/codelion/synth-10M
下载链接
链接失效反馈官方服务:
资源简介:
# PleIAs/SYNTH Sampled Dataset (10,000,000 tokens)
This is a sampled subset of [PleIAs/SYNTH](https://huggingface.co/datasets/PleIAs/SYNTH) containing approximately **14,631,489 tokens**.
## Dataset Details
### Source
- **Original Dataset**: PleIAs/SYNTH (~87B tokens, 79.6M samples)
- **Sampling Method**: Reservoir sampling (unbiased random sampling)
- **Target Token Count**: 10,000,000 tokens
- **Actual Token Count**: 14,631,489 tokens
- **Tokenizer**: GPT-2 (50,257 vocabulary)
### Sampling Statistics
- **Documents Sampled**: 13,345
- **Documents Processed**: 13,345
- **Tokens Processed**: 14,631,489
- **Sampling Rate**: 1.0000
- **Random Seed**: 42
### Text Field Combination
Each sample combines four fields from the original SYNTH dataset:
1. **query**: The question or prompt
2. **query_seed_text**: Wikipedia or reference context
3. **synthetic_reasoning**: Step-by-step reasoning trace
4. **synthetic_answer**: Final answer
This creates comprehensive training examples with full context, reasoning, and answers.
### Sampling Method
This dataset was created using **reservoir sampling**, which ensures:
- ✅ Unbiased random sample from the full dataset
- ✅ Every document has equal probability of being selected
- ✅ No distribution bias (early/late documents equally represented)
- ✅ Efficient processing of 500 parquet files
The sampling algorithm:
1. Streams through all 500 PleIAs/SYNTH parquet files
2. Combines four text fields into comprehensive training examples
3. Uses GPT-2 tokenizer to count tokens per document
4. Maintains a reservoir of documents until target token count
5. For each new document, replaces reservoir items with probability k/n
- k = reservoir size, n = total documents seen
6. Guarantees uniform random sample across entire dataset
## Usage
```python
from datasets import load_dataset
# Load the dataset
dataset = load_dataset("codelion/synth-10M")
# Access the training data
for example in dataset['train']:
print(example['text'])
print(f"Language: {example['language']}")
print(f"Exercise type: {example['exercise']}")
```
## Dataset Structure
Each example contains:
- `text`: Combined text (query + context + reasoning + answer)
- `synth_id`: Original SYNTH dataset ID
- `language`: Language code (en, es, de, fr, pl, it, nl, la, etc.)
- `exercise`: Type of exercise (memorization, mcq, creative writing, math, rag, etc.)
## Exercise Types
The dataset includes diverse synthetic tasks:
- **Memorization**: Question-answering with Wikipedia context
- **MCQ**: Multiple choice questions
- **Creative Writing**: Poetry, stories, creative prompts
- **Math Exercise**: Word problems with step-by-step solutions
- **RAG**: Retrieval-augmented generation tasks
- **Constrained Writing**: Writing with specific constraints
- **Editing**: Text editing and improvement tasks
## Languages
Approximately 80% English with multilingual content in:
- Spanish (es)
- German (de)
- French (fr)
- Polish (pl)
- Italian (it)
- Dutch (nl)
- Latin (la)
- And more
## Use Cases
This sampled dataset is ideal for:
- 🧠 Small-scale reasoning model pretraining
- 🔬 Synthetic data experiments
- 📊 Dataset composition studies
- ⚡ Quick prototyping and testing
- 💰 Low-cost training runs
- 🌍 Multilingual model development
## Citation
If you use this dataset, please cite both the original SYNTH dataset and mention the sampling methodology:
```bibtex
@dataset{synth_sampled_10000000,
title={PleIAs/SYNTH Sampled Dataset (10,000,000 tokens)},
author={CodeLion},
year={2025},
howpublished={\url{https://huggingface.co/datasets/codelion/synth-10M}},
note={Sampled from PleIAs/SYNTH using reservoir sampling}
}
@dataset{synth_original,
title={SYNTH: The First Open Generalist Synthetic Dataset},
author={PleIAs},
year={2025},
howpublished={\url{https://huggingface.co/datasets/PleIAs/SYNTH}}
}
```
## License
Apache 2.0 (same as original SYNTH dataset)
## Dataset Card Authors
CodeLion
## Dataset Card Contact
For questions or issues, please open an issue on the dataset repository.
# PleIAs/SYNTH 采样数据集(10,000,000 Token)
本数据集为 [PleIAs/SYNTH](https://huggingface.co/datasets/PleIAs/SYNTH) 的采样子集,共包含约 **14,631,489 个Token**。
## 数据集详情
### 数据源
- **原始数据集**:PleIAs/SYNTH(约870亿Token,7960万条样本)
- **采样方法**:蓄水池采样(无偏随机采样)
- **目标Token数量**:10,000,000
- **实际Token数量**:14,631,489
- **分词器**:GPT-2(词汇量50,257)
### 采样统计信息
- **采样文档数**:13,345
- **处理文档数**:13,345
- **处理Token总数**:14,631,489
- **采样率**:1.0000
- **随机种子**:42
### 文本字段组合方式
每条样本合并了原始SYNTH数据集的四个字段:
1. **query**:问题或提示词
2. **query_seed_text**:维基百科或参考上下文
3. **synthetic_reasoning**:逐步推理轨迹
4. **synthetic_answer**:最终答案
由此可生成包含完整上下文、推理过程与答案的高质量训练样本。
### 采样方法说明
本数据集采用**蓄水池采样**方法构建,可确保:
- ✅ 对完整数据集进行无偏随机采样
- ✅ 每份文档被选中的概率均等
- ✅ 无分布偏移(早期与晚期文档均能被均匀覆盖)
- ✅ 高效处理500个Parquet文件
采样算法步骤如下:
1. 流式遍历全部500个PleIAs/SYNTH Parquet文件
2. 合并四个文本字段以生成完整训练样本
3. 使用GPT-2分词器统计每份文档的Token数量
4. 维护一个蓄水池存储文档,直至达到目标Token数量
5. 对于每一份新文档,以概率k/n替换蓄水池中的现有条目
- k = 蓄水池容量,n = 已遍历的总文档数
6. 可保证对全数据集的均匀随机采样
## 使用方法
python
from datasets import load_dataset
# 加载数据集
dataset = load_dataset("codelion/synth-10M")
# 访问训练数据
for example in dataset['train']:
print(example['text'])
print(f"语言:{example['language']}")
print(f"任务类型:{example['exercise']}")
## 数据集结构
每条样本包含以下字段:
- `text`:合并后的文本(query + 上下文 + 推理过程 + 答案)
- `synth_id`:原始SYNTH数据集的唯一标识符
- `language`:语言代码(en、es、de、fr、pl、it、nl、la等)
- `exercise`:任务类型(记忆型、选择题、创意写作、数学题、RAG等)
## 任务类型
本数据集包含多样化的合成任务:
- **记忆型任务**:基于维基百科上下文的问答任务
- **MCQ**:多项选择题(Multiple Choice Questions)
- **创意写作**:诗歌、故事、创意提示类写作
- **数学题**:带有逐步解题步骤的应用题
- **RAG**:检索增强生成(Retrieval-Augmented Generation)任务
- **约束写作**:带有特定约束条件的写作任务
- **文本编辑**:文本修改与优化任务
## 支持语言
约80%的样本为英文,其余多语言内容覆盖:
- 西班牙语(es)
- 德语(de)
- 法语(fr)
- 波兰语(pl)
- 意大利语(it)
- 荷兰语(nl)
- 拉丁语(la)
- 及其他语种
## 应用场景
本采样数据集适用于:
- 🧠 小规模推理模型预训练
- 🔬 合成数据相关实验
- 📊 数据集组合研究
- ⚡ 快速原型开发与测试
- 💰 低成本训练运行
- 🌍 多语言模型开发
## 引用规范
若使用本数据集,请同时引用原始SYNTH数据集并注明采样方法:
bibtex
@dataset{synth_sampled_10000000,
title={PleIAs/SYNTH Sampled Dataset (10,000,000 tokens)},
author={CodeLion},
year={2025},
howpublished={url{https://huggingface.co/datasets/codelion/synth-10M}},
note={Sampled from PleIAs/SYNTH using reservoir sampling}
}
@dataset{synth_original,
title={SYNTH: The First Open Generalist Synthetic Dataset},
author={PleIAs},
year={2025},
howpublished={url{https://huggingface.co/datasets/PleIAs/SYNTH}}
}
## 许可证
Apache 2.0(与原始SYNTH数据集一致)
## 数据集卡片作者
CodeLion
## 数据集卡片联系方式
如有疑问或问题,请在数据集仓库中提交Issue。
提供机构:
maas
创建时间:
2025-11-11



