Dodosoomro/simple-100m-pretrain-1b

Name: Dodosoomro/simple-100m-pretrain-1b
Creator: Dodosoomro
Published: 2026-04-13 22:46:50
License: 暂无描述

Hugging Face2026-04-13 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/Dodosoomro/simple-100m-pretrain-1b

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: other task_categories: - text-generation pretty_name: Simple-100M Pretraining Dataset (1B Tokens) size_categories: - 1B<n<10B tags: - text - pretraining - small-llm - packed-dataset - uint16-format - position-ids - gpt2-tokenizer - educational - code - mathematics viewer: false # Disable dataset viewer for packed sequences (not human-readable) --- # Simple-100M Pretraining Dataset (1B Tokens) [![License](https://img.shields.io/badge/license-CC--BY--4.0%20%26%20varies-green)](https://huggingface.co/datasets/Dodosoomro/simple-100m-pretrain-1b) [![Tokens](https://img.shields.io/badge/tokens-1B-blue)](https://huggingface.co/datasets/Dodosoomro/simple-100m-pretrain-1b) [![Format](https://img.shields.io/badge/format-Arrow%20%28uint16%29-orange)](https://huggingface.co/datasets/Dodosoomro/simple-100m-pretrain-1b) A **training-optimized, packed pretraining dataset** for ~100M parameter language models. Built for reproducibility, minimal runtime overhead, and exact mixing ratios. --- ## 🎯 Purpose This dataset was created to train **Simple-100M**, a decoder-only Transformer targeting: - ✅ Beat GPT-2-70M perplexity with minimal complexity - ✅ Reproducible artifacts with exact token accounting - ✅ Zero runtime preprocessing (ready-to-train) **Target Architecture**: 32 layers, 448 hidden, 7 heads, SwiGLU, RoPE, RMSNorm, tied embeddings (~97.8M params). --- ## 📊 Dataset Composition ### Token Allocation (Exact Mixing Ratios) | Source | Tokens | Ratio | Description | |--------|--------|-------|-------------| | Cosmopedia (`web_samples_v1`) | 300M | 30% | Educational content, tutorials, explanations | | FineWeb-Edu (`score≥3`) | 300M | 30% | High-quality educational web text | | Finewiki (`en`) | 200M | 20% | Clean English Wikipedia articles | | OpenWebMath | 100M | 10% | Mathematical content, LaTeX, reasoning | | Python Code (`smollm-corpus:python-edu`) | 80M | 8% | Deduplicated, high-quality Python code | | TinyStories | 20M | 2% | Synthetic short stories for coherence | | **Total** | **1,000M** | **100%** | | ### Train/Validation Split - **Training**: 966,797 sequences × 1,024 tokens = **990,000,128 tokens** (99%) - **Validation**: 9,765 sequences × 1,024 tokens = **9,999,360 tokens** (1%) - Split strategy: Stratified holdout extracted **before** shuffling to prevent leakage --- ## 🗂️ Format & Schema ### File Format - Apache Arrow (`.arrow`) with chunked storage for efficient streaming - Native `uint16` dtype for token IDs (GPT-2 vocab: 0–50,256) ### Schema ```python { "input_ids": Sequence(Value("uint16"), length=1024), # Token IDs "position_ids": Sequence(Value("uint16"), length=1024) # Reset at sequence start }

提供机构：

Dodosoomro

5,000+

优质数据集

54 个

任务类型

进入经典数据集