thepowerfuldeez/1218_imu1_base_stable_corpus

Name: thepowerfuldeez/1218_imu1_base_stable_corpus
Creator: thepowerfuldeez
Published: 2026-02-04 21:31:27
License: 暂无描述

Hugging Face2026-02-04 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/thepowerfuldeez/1218_imu1_base_stable_corpus

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: apache-2.0 task_categories: - text-generation tags: - pretraining - language-model - imu-1 - tokenized size_categories: - 10B<n<100B arxiv: 2602.02522 --- # IMU-1 Stage 1 Training Corpus (Stable Phase) Pre-tokenized training data for Stage 1 (stable phase) of [IMU-1](https://huggingface.co/thepowerfuldeez/imu1_base), a sample-efficient 430M parameter language model. ## Dataset Details | Property | Value | |----------|-------| | Tokens | ~29B | | Format | Memory-mapped NumPy (`.npy`) | | Tokenizer | [SmolLM2-360M](https://huggingface.co/HuggingFaceTB/SmolLM2-360M) | | Vocab size | 49,152 | ## Data Sources High-quality filtered web data including: - DCLM-edu (educational content filtered from DCLM) - FineWeb-edu - Curated web sources ## Download ```bash huggingface-cli download thepowerfuldeez/1218_imu1_base_stable_corpus --repo-type=dataset ``` ## Usage with sample_efficient_gpt ```bash # Clone training framework git clone https://github.com/thepowerfuldeez/sample_efficient_gpt cd sample_efficient_gpt # Install dependencies export UV_TORCH_BACKEND=auto uv pip install setuptools uv_build maturin uv sync # Train Stage 1 uv run torchrun --nproc_per_node 8 train.py \ --config configs/imu1_base.yaml \ --config-key stable ``` ## Training Configuration (Stage 1) | Parameter | Value | |-----------|-------| | Schedule | WSD (stable phase) | | Iterations | 100,000 | | Batch size | 384 | | Context length | 768 | | Muon LR | 1.1e-2 | | Warmup | 2,500 steps | ## Related Resources - **Model:** [thepowerfuldeez/imu1_base](https://huggingface.co/thepowerfuldeez/imu1_base) - **Stage 2 Data:** [thepowerfuldeez/1226_imu1_base_decay_corpus](https://huggingface.co/datasets/thepowerfuldeez/1226_imu1_base_decay_corpus) - **Training Code:** [sample_efficient_gpt](https://github.com/thepowerfuldeez/sample_efficient_gpt) ## Citation ```bibtex @misc{grigorev2026imu1sampleefficientpretrainingsmall, title={IMU-1: Sample-Efficient Pre-training of Small Language Models}, author={George Grigorev}, year={2026}, eprint={2602.02522}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2602.02522}, } ```

提供机构：

thepowerfuldeez

5,000+

优质数据集

54 个

任务类型

进入经典数据集