five

Dodosoomro/simple-100m-pretrain-1b

收藏
Hugging Face2026-04-13 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Dodosoomro/simple-100m-pretrain-1b
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: other task_categories: - text-generation pretty_name: Simple-100M Pretraining Dataset (1B Tokens) size_categories: - 1B<n<10B tags: - text - pretraining - small-llm - packed-dataset - uint16-format - position-ids - gpt2-tokenizer - educational - code - mathematics viewer: false # Disable dataset viewer for packed sequences (not human-readable) --- # Simple-100M Pretraining Dataset (1B Tokens) [![License](https://img.shields.io/badge/license-CC--BY--4.0%20%26%20varies-green)](https://huggingface.co/datasets/Dodosoomro/simple-100m-pretrain-1b) [![Tokens](https://img.shields.io/badge/tokens-1B-blue)](https://huggingface.co/datasets/Dodosoomro/simple-100m-pretrain-1b) [![Format](https://img.shields.io/badge/format-Arrow%20%28uint16%29-orange)](https://huggingface.co/datasets/Dodosoomro/simple-100m-pretrain-1b) A **training-optimized, packed pretraining dataset** for ~100M parameter language models. Built for reproducibility, minimal runtime overhead, and exact mixing ratios. --- ## 🎯 Purpose This dataset was created to train **Simple-100M**, a decoder-only Transformer targeting: - ✅ Beat GPT-2-70M perplexity with minimal complexity - ✅ Reproducible artifacts with exact token accounting - ✅ Zero runtime preprocessing (ready-to-train) **Target Architecture**: 32 layers, 448 hidden, 7 heads, SwiGLU, RoPE, RMSNorm, tied embeddings (~97.8M params). --- ## 📊 Dataset Composition ### Token Allocation (Exact Mixing Ratios) | Source | Tokens | Ratio | Description | |--------|--------|-------|-------------| | Cosmopedia (`web_samples_v1`) | 300M | 30% | Educational content, tutorials, explanations | | FineWeb-Edu (`score≥3`) | 300M | 30% | High-quality educational web text | | Finewiki (`en`) | 200M | 20% | Clean English Wikipedia articles | | OpenWebMath | 100M | 10% | Mathematical content, LaTeX, reasoning | | Python Code (`smollm-corpus:python-edu`) | 80M | 8% | Deduplicated, high-quality Python code | | TinyStories | 20M | 2% | Synthetic short stories for coherence | | **Total** | **1,000M** | **100%** | | ### Train/Validation Split - **Training**: 966,797 sequences × 1,024 tokens = **990,000,128 tokens** (99%) - **Validation**: 9,765 sequences × 1,024 tokens = **9,999,360 tokens** (1%) - Split strategy: Stratified holdout extracted **before** shuffling to prevent leakage --- ## 🗂️ Format & Schema ### File Format - Apache Arrow (`.arrow`) with chunked storage for efficient streaming - Native `uint16` dtype for token IDs (GPT-2 vocab: 0–50,256) ### Schema ```python { "input_ids": Sequence(Value("uint16"), length=1024), # Token IDs "position_ids": Sequence(Value("uint16"), length=1024) # Reset at sequence start }
提供机构:
Dodosoomro
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作