SmolLM2-135M-10B

Name: SmolLM2-135M-10B
Creator: maas
Published: 2025-12-05 16:46:21
License: 暂无描述

魔搭社区2025-12-05 更新2025-12-06 收录

下载链接：

https://modelscope.cn/datasets/EleutherAI/SmolLM2-135M-10B

下载链接

链接失效反馈

官方服务：

资源简介：

This dataset is sampled from the SmolLM2 Corpus described in https://arxiv.org/abs/2502.02737. Specifically, we sampled from the SmolLM2-135M pretraining data, a 2T token mixture consisting of four complete high quality datasets, and selected portions of DCLM-Edu and FineWeb-Edu sampled at a 6:4 ratio. This sample is intended to enable fast downloading and training of [sparsify](https://github.com/EleutherAI/sparsify) models. - FineMath: 34B tokens - Stack-Edu: 125B tokens - InfiMM-WebMath: 40B tokens - Cosmopedia V2: 30B tokens - FineWeb-Edu: 710.4B tokens (1.2T in full dataset) - DCLM-Edu: 1065.6B tokens (3.8T in full dataset) This sample does not include the following datasets used in the otherwise similar Stage 4 of SmolLM2-1.7B training: - [OpenWebMath](https://huggingface.co/datasets/open-web-math/open-web-math): 12B tokens - [AugGSM8K](https://github.com/OFA-Sys/gsm8k-ScRel/tree/main/data/MuggleMATH): ?

本数据集取自arXiv:2502.02737中所述的SmolLM2语料库。具体而言，我们从SmolLM2-135M预训练数据中采样，该数据为包含4个完整高质量数据集的2万亿Token混合语料，并以6:4的比例选取了DCLM-Edu与FineWeb-Edu的部分数据。本次采样旨在支持[sparsify](https://github.com/EleutherAI/sparsify)模型的快速下载与训练。 - FineMath：340亿Token - Stack-Edu：1250亿Token - InfiMM-WebMath：400亿Token - Cosmopedia V2：300亿Token - FineWeb-Edu：7104亿Token（完整数据集规模为1.2万亿Token） - DCLM-Edu：10656亿Token（完整数据集规模为3.8万亿Token）本采样数据集未包含SmolLM2-1.7B训练的类似阶段4中所使用的以下数据集： - [OpenWebMath](https://huggingface.co/datasets/open-web-math/open-web-math)：120亿Token - [AugGSM8K](https://github.com/OFA-Sys/gsm8k-ScRel/tree/main/data/MuggleMATH)：数据量未知

提供机构：

maas

创建时间：

2025-08-15

5,000+

优质数据集

54 个

任务类型

进入经典数据集