vukrosic/blueberry-1B-pretrain

Name: vukrosic/blueberry-1B-pretrain
Creator: vukrosic
Published: 2025-12-17 17:18:16
License: 暂无描述

Hugging Face2025-12-17 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/vukrosic/blueberry-1B-pretrain

下载链接

链接失效反馈

官方服务：

资源简介：

这是一个用于训练Blueberry-Nano模型（151M参数）的预标记、打包和洗牌的数据集，包含大约10亿个标记。数据集细节包括总标记数约为1,000,000,000，序列长度为2048，标记器使用HuggingFaceTB/SmolLM2-135M，格式为打包序列（input_ids + labels），保存为Arrow/Parquet文件。数据混合了70%的高质量教育网页内容（FineWeb-Edu）和30%的合成教科书和百科全书内容（Cosmopedia-v2）。

This is the pre-tokenized, packed, and shuffled dataset used to train the **Blueberry-Nano** model (151M params). It contains approximately **1 Billion tokens**. Dataset details include **Total Tokens**: ~1,000,000,000, **Sequence Length**: 2048, **Tokenizer**: `HuggingFaceTB/SmolLM2-135M`, **Format**: Packed sequences (input_ids + labels), saved as Arrow/Parquet. The dataset consists of a globally shuffled mix of 70% high-quality educational web content (FineWeb-Edu) and 30% synthetic textbook and encyclopedic content (Cosmopedia-v2).

提供机构：

vukrosic

5,000+

优质数据集

54 个

任务类型

进入经典数据集