chengjunyan1/smollm-12.5-corpus

Name: chengjunyan1/smollm-12.5-corpus
Creator: chengjunyan1
Published: 2024-09-23 20:44:32
License: 暂无描述

Hugging Face2024-09-23 更新2024-12-14 收录

下载链接：

https://hf-mirror.com/datasets/chengjunyan1/smollm-12.5-corpus

下载链接

链接失效反馈

官方服务：

资源简介：

SmolLM-1/8-Corpus数据集是SmolLM Corpus的上质量子集，用于训练Chinchilla-optimal GPT-2规模的模型。该数据集主要来源于FineWeb-edu-dedup，占据了约70%的语料库，其他数据集如Python-Edu、OpenWebMath、StackOverFlow和DeepMindMath-small则根据特定比例进行采样。所有数据集的采样均使用随机种子42以确保可重复性。数据集的统计信息包括各数据集的token数量及其在训练、测试和评估集中的分布。

The SmolLM-1/8-Corpus dataset is a high-quality subset of the SmolLM Corpus, designed for training Chinchilla-optimal GPT-2 scale (sub 1.5B) models. The dataset first filters samples with int_score >=4 from FineWeb-edu-dedup, then maintains the same training mixture distribution as SmolLM. FineWeb-Edu-dedup occupies around 70% of the corpus. Other datasets are sampled based on their respective mixture ratios. For Python-Edu, the score cutoff is set to 3.65 to control the ratio. Other datasets are sampled randomly. All random seeds are 42. Following the Pile method, 1GB of data is randomly sampled from the original SmolLM Corpus for each of the test and eval sets, then any verbatim duplicates are removed from the training set. The dataset includes FineWeb-Edu-dedup, Cosmopedia-v2, Python-Edu, OpenWebMath, StackOverFlow, and DeepMindMath-small, each with specific token counts in the training, test, and eval sets.

提供机构：

chengjunyan1

5,000+

优质数据集

54 个

任务类型

进入经典数据集