thepowerfuldeez/1226_imu1_base_decay_corpus
收藏Hugging Face2026-02-04 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/thepowerfuldeez/1226_imu1_base_decay_corpus
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: apache-2.0
task_categories:
- text-generation
tags:
- pretraining
- language-model
- imu-1
- tokenized
size_categories:
- 10B<n<100B
arxiv: 2602.02522
---
# IMU-1 Stage 2 Training Corpus (Decay Phase)
Pre-tokenized training data for Stage 2 (decay phase) of [IMU-1](https://huggingface.co/thepowerfuldeez/imu1_base), a sample-efficient 430M parameter language model.
## Dataset Details
| Property | Value |
|----------|-------|
| Tokens | ~28B |
| Format | Memory-mapped NumPy (`.npy`) |
| Tokenizer | [SmolLM2-360M](https://huggingface.co/HuggingFaceTB/SmolLM2-360M) |
| Vocab size | 49,152 |
## Data Sources
Stage 2 uses tighter quality filters compared to Stage 1:
- DCLM-edu (higher threshold filtering)
- FineWeb-edu
- FineMath
- Curated high-quality sources
## Download
```bash
huggingface-cli download thepowerfuldeez/1226_imu1_base_decay_corpus --repo-type=dataset
```
## Usage with sample_efficient_gpt
```bash
# Clone training framework
git clone https://github.com/thepowerfuldeez/sample_efficient_gpt
cd sample_efficient_gpt
# Install dependencies
export UV_TORCH_BACKEND=auto
uv pip install setuptools uv_build maturin
uv sync
# Train Stage 2 (requires Stage 1 checkpoint)
uv run torchrun --nproc_per_node 8 train.py \
--config configs/imu1_base.yaml \
--config-key decay
```
## Training Configuration (Stage 2)
| Parameter | Value |
|-----------|-------|
| Schedule | WSD (decay phase) |
| Iterations | 100,000 (200k total) |
| Batch size | 312 |
| Context length | 896 |
| Muon LR | 1.15e-2 → 25% min |
| Decay start | 100k steps |
## Related Resources
- **Model:** [thepowerfuldeez/imu1_base](https://huggingface.co/thepowerfuldeez/imu1_base)
- **Stage 1 Data:** [thepowerfuldeez/1218_imu1_base_stable_corpus](https://huggingface.co/datasets/thepowerfuldeez/1218_imu1_base_stable_corpus)
- **Training Code:** [sample_efficient_gpt](https://github.com/thepowerfuldeez/sample_efficient_gpt)
## Citation
```bibtex
@misc{grigorev2026imu1sampleefficientpretrainingsmall,
title={IMU-1: Sample-Efficient Pre-training of Small Language Models},
author={George Grigorev},
year={2026},
eprint={2602.02522},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2602.02522},
}
```
提供机构:
thepowerfuldeez



