five

thepowerfuldeez/1226_imu1_base_decay_corpus

收藏
Hugging Face2026-02-04 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/thepowerfuldeez/1226_imu1_base_decay_corpus
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: apache-2.0 task_categories: - text-generation tags: - pretraining - language-model - imu-1 - tokenized size_categories: - 10B<n<100B arxiv: 2602.02522 --- # IMU-1 Stage 2 Training Corpus (Decay Phase) Pre-tokenized training data for Stage 2 (decay phase) of [IMU-1](https://huggingface.co/thepowerfuldeez/imu1_base), a sample-efficient 430M parameter language model. ## Dataset Details | Property | Value | |----------|-------| | Tokens | ~28B | | Format | Memory-mapped NumPy (`.npy`) | | Tokenizer | [SmolLM2-360M](https://huggingface.co/HuggingFaceTB/SmolLM2-360M) | | Vocab size | 49,152 | ## Data Sources Stage 2 uses tighter quality filters compared to Stage 1: - DCLM-edu (higher threshold filtering) - FineWeb-edu - FineMath - Curated high-quality sources ## Download ```bash huggingface-cli download thepowerfuldeez/1226_imu1_base_decay_corpus --repo-type=dataset ``` ## Usage with sample_efficient_gpt ```bash # Clone training framework git clone https://github.com/thepowerfuldeez/sample_efficient_gpt cd sample_efficient_gpt # Install dependencies export UV_TORCH_BACKEND=auto uv pip install setuptools uv_build maturin uv sync # Train Stage 2 (requires Stage 1 checkpoint) uv run torchrun --nproc_per_node 8 train.py \ --config configs/imu1_base.yaml \ --config-key decay ``` ## Training Configuration (Stage 2) | Parameter | Value | |-----------|-------| | Schedule | WSD (decay phase) | | Iterations | 100,000 (200k total) | | Batch size | 312 | | Context length | 896 | | Muon LR | 1.15e-2 → 25% min | | Decay start | 100k steps | ## Related Resources - **Model:** [thepowerfuldeez/imu1_base](https://huggingface.co/thepowerfuldeez/imu1_base) - **Stage 1 Data:** [thepowerfuldeez/1218_imu1_base_stable_corpus](https://huggingface.co/datasets/thepowerfuldeez/1218_imu1_base_stable_corpus) - **Training Code:** [sample_efficient_gpt](https://github.com/thepowerfuldeez/sample_efficient_gpt) ## Citation ```bibtex @misc{grigorev2026imu1sampleefficientpretrainingsmall, title={IMU-1: Sample-Efficient Pre-training of Small Language Models}, author={George Grigorev}, year={2026}, eprint={2602.02522}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2602.02522}, } ```
提供机构:
thepowerfuldeez
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作