EleutherAI/SmolLM-135M-100b

Name: EleutherAI/SmolLM-135M-100b
Creator: EleutherAI
Published: 2025-03-18 11:50:17
License: 暂无描述

Hugging Face2025-03-18 更新2025-04-08 收录

下载链接：

https://hf-mirror.com/datasets/EleutherAI/SmolLM-135M-100b

下载链接

链接失效反馈

官方服务：

资源简介：

这是一个大约1000亿token的文本数据集，它是由用于训练SmolLM-135M模型的SmolLM语料库混合而成的样本。数据集包含两个特征：文本内容和来源信息。文本内容以字符串形式存储，同时记录了每个文本的来源。训练集大小为425,062,797,780字节，共有约1,089,554,32个示例。

This is a text dataset with approximately 100 billion tokens, consisting of samples from the mixed SmolLM corpus used to train the SmolLM-135M model. The dataset includes two features: text content and source information, both stored as strings. The training set is 425,062,797,780 bytes in size and contains about 108,955,432 examples.

提供机构：

EleutherAI

5,000+

优质数据集

54 个

任务类型

进入经典数据集