synth-100M
收藏魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/codelion/synth-100M
下载链接
链接失效反馈官方服务:
资源简介:
# PleIAs/SYNTH Sampled Dataset (100,000,000 tokens)
This is a sampled subset of [PleIAs/SYNTH](https://huggingface.co/datasets/PleIAs/SYNTH) containing approximately **109,149,965 tokens**.
## Dataset Details
### Source
- **Original Dataset**: PleIAs/SYNTH (~87B tokens, 79.6M samples)
- **Sampling Method**: Reservoir sampling (unbiased random sampling)
- **Target Token Count**: 100,000,000 tokens
- **Actual Token Count**: 109,149,965 tokens
- **Tokenizer**: GPT-2 (50,257 vocabulary)
### Sampling Statistics
- **Documents Sampled**: 100,000
- **Documents Processed**: 100,000
- **Tokens Processed**: 109,149,965
- **Sampling Rate**: 1.0000
- **Random Seed**: 42
### Text Field Combination
Each sample combines four fields from the original SYNTH dataset:
1. **query**: The question or prompt
2. **query_seed_text**: Wikipedia or reference context
3. **synthetic_reasoning**: Step-by-step reasoning trace
4. **synthetic_answer**: Final answer
This creates comprehensive training examples with full context, reasoning, and answers.
### Sampling Method
This dataset was created using **reservoir sampling**, which ensures:
- ✅ Unbiased random sample from the full dataset
- ✅ Every document has equal probability of being selected
- ✅ No distribution bias (early/late documents equally represented)
- ✅ Efficient processing of 500 parquet files
The sampling algorithm:
1. Streams through all 500 PleIAs/SYNTH parquet files
2. Combines four text fields into comprehensive training examples
3. Uses GPT-2 tokenizer to count tokens per document
4. Maintains a reservoir of documents until target token count
5. For each new document, replaces reservoir items with probability k/n
- k = reservoir size, n = total documents seen
6. Guarantees uniform random sample across entire dataset
## Usage
```python
from datasets import load_dataset
# Load the dataset
dataset = load_dataset("codelion/synth-100M")
# Access the training data
for example in dataset['train']:
print(example['text'])
print(f"Language: {example['language']}")
print(f"Exercise type: {example['exercise']}")
```
## Dataset Structure
Each example contains:
- `text`: Combined text (query + context + reasoning + answer)
- `synth_id`: Original SYNTH dataset ID
- `language`: Language code (en, es, de, fr, pl, it, nl, la, etc.)
- `exercise`: Type of exercise (memorization, mcq, creative writing, math, rag, etc.)
## Exercise Types
The dataset includes diverse synthetic tasks:
- **Memorization**: Question-answering with Wikipedia context
- **MCQ**: Multiple choice questions
- **Creative Writing**: Poetry, stories, creative prompts
- **Math Exercise**: Word problems with step-by-step solutions
- **RAG**: Retrieval-augmented generation tasks
- **Constrained Writing**: Writing with specific constraints
- **Editing**: Text editing and improvement tasks
## Languages
Approximately 80% English with multilingual content in:
- Spanish (es)
- German (de)
- French (fr)
- Polish (pl)
- Italian (it)
- Dutch (nl)
- Latin (la)
- And more
## Use Cases
This sampled dataset is ideal for:
- 🧠 Small-scale reasoning model pretraining
- 🔬 Synthetic data experiments
- 📊 Dataset composition studies
- ⚡ Quick prototyping and testing
- 💰 Low-cost training runs
- 🌍 Multilingual model development
## Citation
If you use this dataset, please cite both the original SYNTH dataset and mention the sampling methodology:
```bibtex
@dataset{synth_sampled_100000000,
title={PleIAs/SYNTH Sampled Dataset (100,000,000 tokens)},
author={CodeLion},
year={2025},
howpublished={\url{https://huggingface.co/datasets/codelion/synth-100M}},
note={Sampled from PleIAs/SYNTH using reservoir sampling}
}
@dataset{synth_original,
title={SYNTH: The First Open Generalist Synthetic Dataset},
author={PleIAs},
year={2025},
howpublished={\url{https://huggingface.co/datasets/PleIAs/SYNTH}}
}
```
## License
Apache 2.0 (same as original SYNTH dataset)
## Dataset Card Authors
CodeLion
## Dataset Card Contact
For questions or issues, please open an issue on the dataset repository.
# PleIAs/SYNTH 采样数据集(1亿Token)
这是[PleIAs/SYNTH](https://huggingface.co/datasets/PleIAs/SYNTH)的一个采样子集,包含约**109,149,965个Token**。
## 数据集详情
### 来源
- **原始数据集**:PleIAs/SYNTH(约870亿Token,7960万样本)
- **采样方法**:水库采样(Reservoir sampling,无偏随机采样)
- **目标Token数**:1亿
- **实际Token数**:109,149,965
- **分词器**:GPT-2分词器(Tokenizer)
### 采样统计
- **采样文档数**:100,000
- **处理文档数**:100,000
- **处理Token数**:109,149,965
- **采样率**:1.0000
- **随机种子**:42
### 文本字段整合
每个样本整合了原始SYNTH数据集的四个字段:
1. **query**:问题或提示词
2. **query_seed_text**:维基百科或参考上下文
3. **synthetic_reasoning**:分步推理过程
4. **synthetic_answer**:最终答案
这构建了包含完整上下文、推理过程和答案的综合训练样本。
### 采样方法
本数据集采用水库采样(Reservoir sampling)构建,该方法确保:
- ✅ 无偏随机采样自完整数据集
- ✅ 每个文档被选中的概率均等
- ✅ 无分布偏差(早期/后期文档均得到平等代表)
- ✅ 高效处理500个Parquet文件
采样算法:
1. 流式处理所有500个PleIAs/SYNTH Parquet文件
2. 将四个文本字段整合成综合训练样本
3. 使用GPT-2分词器统计每个文档的Token数
4. 维护一个文档水库直至达到目标Token数
5. 对于每个新文档,以k/n的概率替换水库中的项
- k = 水库大小,n = 已见文档总数
6. 保证整个数据集的均匀随机采样
## 使用方法
python
from datasets import load_dataset
# Load the dataset
dataset = load_dataset("codelion/synth-100M")
# Access the training data
for example in dataset['train']:
print(example['text'])
print(f"Language: {example['language']}")
print(f"Exercise type: {example['exercise']}")
## 数据集结构
每个样本包含:
- `text`:整合文本(查询+上下文+推理+答案)
- `synth_id`:原始SYNTH数据集ID
- `language`:语言代码(en、es、de、fr、pl、it、nl、la等)
- `exercise`:练习类型(记忆、MCQ、创意写作、数学、RAG等)
## 练习类型
数据集包含多种合成任务:
- **记忆**:基于维基百科上下文的问答
- **MCQ**:多项选择题
- **创意写作**:诗歌、故事、创意提示词
- **数学练习**:带分步解答的文字题
- **RAG**:检索增强生成(RAG,Retrieval-augmented generation)任务
- **受限写作**:带特定约束的写作
- **编辑**:文本编辑与改进任务
## 语言
约80%为英语,多语言内容包括:
- 西班牙语(es)
- 德语(de)
- 法语(fr)
- 波兰语(pl)
- 意大利语(it)
- 荷兰语(nl)
- 拉丁语(la)
- 以及更多
## 适用场景
本采样数据集适用于:
- 🧠 小规模推理模型预训练
- 🔬 合成数据实验
- 📊 数据集构成研究
- ⚡ 快速原型设计与测试
- 💰 低成本训练运行
- 🌍 多语言模型开发
## 引用
若使用本数据集,请同时引用原始SYNTH数据集并注明采样方法:
bibtex
@dataset{synth_sampled_100000000,
title={PleIAs/SYNTH Sampled Dataset (100,000,000 tokens)},
author={CodeLion},
year={2025},
howpublished={url{https://huggingface.co/datasets/codelion/synth-100M}},
note={Sampled from PleIAs/SYNTH using reservoir sampling}
}
@dataset{synth_original,
title={SYNTH: The First Open Generalist Synthetic Dataset},
author={PleIAs},
year={2025},
howpublished={url{https://huggingface.co/datasets/PleIAs/SYNTH}}
}
## 许可协议
Apache 2.0(与原始SYNTH数据集相同)
## 数据集卡片作者
CodeLion
## 数据集卡片联系方式
如有疑问或问题,请在数据集仓库中提交issue。
提供机构:
maas
创建时间:
2025-11-11



