ShallowU/FineWeb-Edu-10B-Tokens-NPY

Name: ShallowU/FineWeb-Edu-10B-Tokens-NPY
Creator: ShallowU
Published: 2025-07-15 08:24:24
License: 暂无描述

Hugging Face2025-07-15 更新2025-08-30 收录

下载链接：

https://hf-mirror.com/datasets/ShallowU/FineWeb-Edu-10B-Tokens-NPY

下载链接

链接失效反馈

官方服务：

资源简介：

这是一个包含约100亿个tokens的预处理教育文本数据集，专为训练小型语言模型（如GPT-2 124M）设计。数据来源于FineWeb-Edu数据集，经过GPT-2的tiktoken分词器预处理并保存为numpy格式，以提高训练效率。适用于小型语言模型训练、教育研究和快速原型开发。

This is a preprocessed educational text dataset containing approximately 10 billion tokens, specifically designed for training small-scale language models (such as GPT-2 124M). The data is sourced from the FineWeb-Edu dataset, preprocessed with GPT-2s tiktoken tokenizer, and saved in the numpy format to enhance training efficiency. It is suitable for small-scale language model training, educational research, and rapid prototyping.

提供机构：

ShallowU

5,000+

优质数据集

54 个

任务类型

进入经典数据集