thepowerfuldeez/1218_imu1_base_stable_corpus
收藏Hugging Face2026-02-04 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/thepowerfuldeez/1218_imu1_base_stable_corpus
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: apache-2.0
task_categories:
- text-generation
tags:
- pretraining
- language-model
- imu-1
- tokenized
size_categories:
- 10B<n<100B
arxiv: 2602.02522
---
# IMU-1 Stage 1 Training Corpus (Stable Phase)
Pre-tokenized training data for Stage 1 (stable phase) of [IMU-1](https://huggingface.co/thepowerfuldeez/imu1_base), a sample-efficient 430M parameter language model.
## Dataset Details
| Property | Value |
|----------|-------|
| Tokens | ~29B |
| Format | Memory-mapped NumPy (`.npy`) |
| Tokenizer | [SmolLM2-360M](https://huggingface.co/HuggingFaceTB/SmolLM2-360M) |
| Vocab size | 49,152 |
## Data Sources
High-quality filtered web data including:
- DCLM-edu (educational content filtered from DCLM)
- FineWeb-edu
- Curated web sources
## Download
```bash
huggingface-cli download thepowerfuldeez/1218_imu1_base_stable_corpus --repo-type=dataset
```
## Usage with sample_efficient_gpt
```bash
# Clone training framework
git clone https://github.com/thepowerfuldeez/sample_efficient_gpt
cd sample_efficient_gpt
# Install dependencies
export UV_TORCH_BACKEND=auto
uv pip install setuptools uv_build maturin
uv sync
# Train Stage 1
uv run torchrun --nproc_per_node 8 train.py \
--config configs/imu1_base.yaml \
--config-key stable
```
## Training Configuration (Stage 1)
| Parameter | Value |
|-----------|-------|
| Schedule | WSD (stable phase) |
| Iterations | 100,000 |
| Batch size | 384 |
| Context length | 768 |
| Muon LR | 1.1e-2 |
| Warmup | 2,500 steps |
## Related Resources
- **Model:** [thepowerfuldeez/imu1_base](https://huggingface.co/thepowerfuldeez/imu1_base)
- **Stage 2 Data:** [thepowerfuldeez/1226_imu1_base_decay_corpus](https://huggingface.co/datasets/thepowerfuldeez/1226_imu1_base_decay_corpus)
- **Training Code:** [sample_efficient_gpt](https://github.com/thepowerfuldeez/sample_efficient_gpt)
## Citation
```bibtex
@misc{grigorev2026imu1sampleefficientpretrainingsmall,
title={IMU-1: Sample-Efficient Pre-training of Small Language Models},
author={George Grigorev},
year={2026},
eprint={2602.02522},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2602.02522},
}
```
提供机构:
thepowerfuldeez



