wottAI/textpack-20b-tokenized

Name: wottAI/textpack-20b-tokenized
Creator: wottAI
Published: 2025-04-23 09:17:52
License: 暂无描述

Hugging Face2025-04-23 更新2025-11-03 收录

下载链接：

https://hf-mirror.com/datasets/wottAI/textpack-20b-tokenized

下载链接

链接失效反馈

官方服务：

资源简介：

这是一个经过预处理的、用于解码器独有Transformer语言模型预训练的token打包的二进制文件数据集。每个.bin文件包含固定数量的样本，每个样本长度为8192个token。样本被分组成每个批次125个样本，总共有1048576个token。数据集包含了来自多个高质量开放数据集的token，如C4 (en)、Wikipedia、OpenWebText等。数据集通过特定的策略进行了预处理和打包，以确保token的混合和样本的连续性。

This dataset contains preprocessed and token-packed `.bin` files intended for use in pretraining a decoder-only Transformer language model. Each `.bin` file contains a fixed number of samples, each 8192 tokens in length. Samples are grouped into batches of 125, totaling 1.024 million tokens per batch. The dataset includes tokens from a diverse mix of high-quality open datasets such as `C4 (en)`, `Wikipedia`, `OpenWebText`, and others. The dataset has been preprocessed and packed using specific strategies to ensure a balanced mix of tokens and continuity of samples.

提供机构：

wottAI

5,000+

优质数据集

54 个

任务类型

进入经典数据集