Dodosoomro/simple-100m-pretrain-1b
收藏Hugging Face2026-04-13 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Dodosoomro/simple-100m-pretrain-1b
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: other
task_categories:
- text-generation
pretty_name: Simple-100M Pretraining Dataset (1B Tokens)
size_categories:
- 1B<n<10B
tags:
- text
- pretraining
- small-llm
- packed-dataset
- uint16-format
- position-ids
- gpt2-tokenizer
- educational
- code
- mathematics
viewer: false # Disable dataset viewer for packed sequences (not human-readable)
---
# Simple-100M Pretraining Dataset (1B Tokens)
[](https://huggingface.co/datasets/Dodosoomro/simple-100m-pretrain-1b)
[](https://huggingface.co/datasets/Dodosoomro/simple-100m-pretrain-1b)
[](https://huggingface.co/datasets/Dodosoomro/simple-100m-pretrain-1b)
A **training-optimized, packed pretraining dataset** for ~100M parameter language models. Built for reproducibility, minimal runtime overhead, and exact mixing ratios.
---
## 🎯 Purpose
This dataset was created to train **Simple-100M**, a decoder-only Transformer targeting:
- ✅ Beat GPT-2-70M perplexity with minimal complexity
- ✅ Reproducible artifacts with exact token accounting
- ✅ Zero runtime preprocessing (ready-to-train)
**Target Architecture**: 32 layers, 448 hidden, 7 heads, SwiGLU, RoPE, RMSNorm, tied embeddings (~97.8M params).
---
## 📊 Dataset Composition
### Token Allocation (Exact Mixing Ratios)
| Source | Tokens | Ratio | Description |
|--------|--------|-------|-------------|
| Cosmopedia (`web_samples_v1`) | 300M | 30% | Educational content, tutorials, explanations |
| FineWeb-Edu (`score≥3`) | 300M | 30% | High-quality educational web text |
| Finewiki (`en`) | 200M | 20% | Clean English Wikipedia articles |
| OpenWebMath | 100M | 10% | Mathematical content, LaTeX, reasoning |
| Python Code (`smollm-corpus:python-edu`) | 80M | 8% | Deduplicated, high-quality Python code |
| TinyStories | 20M | 2% | Synthetic short stories for coherence |
| **Total** | **1,000M** | **100%** | |
### Train/Validation Split
- **Training**: 966,797 sequences × 1,024 tokens = **990,000,128 tokens** (99%)
- **Validation**: 9,765 sequences × 1,024 tokens = **9,999,360 tokens** (1%)
- Split strategy: Stratified holdout extracted **before** shuffling to prevent leakage
---
## 🗂️ Format & Schema
### File Format
- Apache Arrow (`.arrow`) with chunked storage for efficient streaming
- Native `uint16` dtype for token IDs (GPT-2 vocab: 0–50,256)
### Schema
```python
{
"input_ids": Sequence(Value("uint16"), length=1024), # Token IDs
"position_ids": Sequence(Value("uint16"), length=1024) # Reset at sequence start
}
提供机构:
Dodosoomro



