juliannunezb/mixed-pretrain-10b-gpt2

Name: juliannunezb/mixed-pretrain-10b-gpt2
Creator: juliannunezb
Published: 2026-04-22 00:10:44
License: 暂无描述

Hugging Face2026-04-22 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/juliannunezb/mixed-pretrain-10b-gpt2

下载链接

链接失效反馈

官方服务：

资源简介：

一个包含100亿个标记的预训练数据集，使用GPT-2 BPE进行标记化，由多样化的网络文本、书籍、维基百科、代码、学术论文、问答和指令格式对话混合而成。该数据集旨在训练一个约5亿参数的从头开始的GPT-2风格变换器模型。数据集被组织成多个分片，每个分片包含1000万个标记，总共有100个分片，总计100亿个标记。数据集来源包括fineweb、fineweb_edu、openwebtext、pg19、wikipedia、github、stackexchange和instruct等，每个来源的标记数量和混合比例在README中有详细说明。数据集适用于研究和开放语言模型的预训练。

A 10-billion-token pretraining dataset, GPT-2 BPE tokenized, assembled as a diverse mix of web text, books, Wikipedia, code, academic papers, Q&A and instruction-formatted conversations. Built to train a ~500M parameter from-scratch GPT-2-style transformer. The dataset is organized into shards, each containing 100 million tokens, with a total of 100 shards summing up to 10 billion tokens. Sources include fineweb, fineweb_edu, openwebtext, pg19, wikipedia, github, stackexchange, and instruct, with detailed proportions and token counts provided in the README. The dataset is intended for research and pretraining of open language models.

提供机构：

juliannunezb

5,000+

优质数据集

54 个

任务类型

进入经典数据集