vukrosic/blueberry-1B-pretrain
收藏Hugging Face2025-12-17 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/vukrosic/blueberry-1B-pretrain
下载链接
链接失效反馈官方服务:
资源简介:
这是一个用于训练Blueberry-Nano模型(151M参数)的预标记、打包和洗牌的数据集,包含大约10亿个标记。数据集细节包括总标记数约为1,000,000,000,序列长度为2048,标记器使用HuggingFaceTB/SmolLM2-135M,格式为打包序列(input_ids + labels),保存为Arrow/Parquet文件。数据混合了70%的高质量教育网页内容(FineWeb-Edu)和30%的合成教科书和百科全书内容(Cosmopedia-v2)。
This is the pre-tokenized, packed, and shuffled dataset used to train the **Blueberry-Nano** model (151M params). It contains approximately **1 Billion tokens**. Dataset details include **Total Tokens**: ~1,000,000,000, **Sequence Length**: 2048, **Tokenizer**: `HuggingFaceTB/SmolLM2-135M`, **Format**: Packed sequences (input_ids + labels), saved as Arrow/Parquet. The dataset consists of a globally shuffled mix of 70% high-quality educational web content (FineWeb-Edu) and 30% synthetic textbook and encyclopedic content (Cosmopedia-v2).
提供机构:
vukrosic



