SmolLM2-135M-10B
收藏魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/EleutherAI/SmolLM2-135M-10B
下载链接
链接失效反馈官方服务:
资源简介:
This dataset is sampled from the SmolLM2 Corpus described in https://arxiv.org/abs/2502.02737. Specifically, we sampled from
the SmolLM2-135M pretraining data, a 2T token mixture consisting of four complete high quality datasets, and selected portions of
DCLM-Edu and FineWeb-Edu sampled at a 6:4 ratio.
This sample is intended to enable fast downloading and training of [sparsify](https://github.com/EleutherAI/sparsify) models.
- FineMath: 34B tokens
- Stack-Edu: 125B tokens
- InfiMM-WebMath: 40B tokens
- Cosmopedia V2: 30B tokens
- FineWeb-Edu: 710.4B tokens (1.2T in full dataset)
- DCLM-Edu: 1065.6B tokens (3.8T in full dataset)
This sample does not include the following datasets used in the otherwise similar Stage 4 of SmolLM2-1.7B training:
- [OpenWebMath](https://huggingface.co/datasets/open-web-math/open-web-math): 12B tokens
- [AugGSM8K](https://github.com/OFA-Sys/gsm8k-ScRel/tree/main/data/MuggleMATH): ?
本数据集取自arXiv:2502.02737中所述的SmolLM2语料库。具体而言,我们从SmolLM2-135M预训练数据中采样,该数据为包含4个完整高质量数据集的2万亿Token混合语料,并以6:4的比例选取了DCLM-Edu与FineWeb-Edu的部分数据。
本次采样旨在支持[sparsify](https://github.com/EleutherAI/sparsify)模型的快速下载与训练。
- FineMath:340亿Token
- Stack-Edu:1250亿Token
- InfiMM-WebMath:400亿Token
- Cosmopedia V2:300亿Token
- FineWeb-Edu:7104亿Token(完整数据集规模为1.2万亿Token)
- DCLM-Edu:10656亿Token(完整数据集规模为3.8万亿Token)
本采样数据集未包含SmolLM2-1.7B训练的类似阶段4中所使用的以下数据集:
- [OpenWebMath](https://huggingface.co/datasets/open-web-math/open-web-math):120亿Token
- [AugGSM8K](https://github.com/OFA-Sys/gsm8k-ScRel/tree/main/data/MuggleMATH):数据量未知
提供机构:
maas
创建时间:
2025-08-15



