five

thepowerfuldeez/1218_imu1_base_stable_corpus

收藏
Hugging Face2026-02-04 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/thepowerfuldeez/1218_imu1_base_stable_corpus
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: apache-2.0 task_categories: - text-generation tags: - pretraining - language-model - imu-1 - tokenized size_categories: - 10B<n<100B arxiv: 2602.02522 --- # IMU-1 Stage 1 Training Corpus (Stable Phase) Pre-tokenized training data for Stage 1 (stable phase) of [IMU-1](https://huggingface.co/thepowerfuldeez/imu1_base), a sample-efficient 430M parameter language model. ## Dataset Details | Property | Value | |----------|-------| | Tokens | ~29B | | Format | Memory-mapped NumPy (`.npy`) | | Tokenizer | [SmolLM2-360M](https://huggingface.co/HuggingFaceTB/SmolLM2-360M) | | Vocab size | 49,152 | ## Data Sources High-quality filtered web data including: - DCLM-edu (educational content filtered from DCLM) - FineWeb-edu - Curated web sources ## Download ```bash huggingface-cli download thepowerfuldeez/1218_imu1_base_stable_corpus --repo-type=dataset ``` ## Usage with sample_efficient_gpt ```bash # Clone training framework git clone https://github.com/thepowerfuldeez/sample_efficient_gpt cd sample_efficient_gpt # Install dependencies export UV_TORCH_BACKEND=auto uv pip install setuptools uv_build maturin uv sync # Train Stage 1 uv run torchrun --nproc_per_node 8 train.py \ --config configs/imu1_base.yaml \ --config-key stable ``` ## Training Configuration (Stage 1) | Parameter | Value | |-----------|-------| | Schedule | WSD (stable phase) | | Iterations | 100,000 | | Batch size | 384 | | Context length | 768 | | Muon LR | 1.1e-2 | | Warmup | 2,500 steps | ## Related Resources - **Model:** [thepowerfuldeez/imu1_base](https://huggingface.co/thepowerfuldeez/imu1_base) - **Stage 2 Data:** [thepowerfuldeez/1226_imu1_base_decay_corpus](https://huggingface.co/datasets/thepowerfuldeez/1226_imu1_base_decay_corpus) - **Training Code:** [sample_efficient_gpt](https://github.com/thepowerfuldeez/sample_efficient_gpt) ## Citation ```bibtex @misc{grigorev2026imu1sampleefficientpretrainingsmall, title={IMU-1: Sample-Efficient Pre-training of Small Language Models}, author={George Grigorev}, year={2026}, eprint={2602.02522}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2602.02522}, } ```
提供机构:
thepowerfuldeez
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作