EleutherAI/SmolLM-135M-100b
收藏Hugging Face2025-03-18 更新2025-04-08 收录
下载链接:
https://hf-mirror.com/datasets/EleutherAI/SmolLM-135M-100b
下载链接
链接失效反馈官方服务:
资源简介:
这是一个大约1000亿token的文本数据集,它是由用于训练SmolLM-135M模型的SmolLM语料库混合而成的样本。数据集包含两个特征:文本内容和来源信息。文本内容以字符串形式存储,同时记录了每个文本的来源。训练集大小为425,062,797,780字节,共有约1,089,554,32个示例。
This is a text dataset with approximately 100 billion tokens, consisting of samples from the mixed SmolLM corpus used to train the SmolLM-135M model. The dataset includes two features: text content and source information, both stored as strings. The training set is 425,062,797,780 bytes in size and contains about 108,955,432 examples.
提供机构:
EleutherAI



