wottAI/textpack-20b-tokenized
收藏Hugging Face2025-04-23 更新2025-11-03 收录
下载链接:
https://hf-mirror.com/datasets/wottAI/textpack-20b-tokenized
下载链接
链接失效反馈官方服务:
资源简介:
这是一个经过预处理的、用于解码器独有Transformer语言模型预训练的token打包的二进制文件数据集。每个.bin文件包含固定数量的样本,每个样本长度为8192个token。样本被分组成每个批次125个样本,总共有1048576个token。数据集包含了来自多个高质量开放数据集的token,如C4 (en)、Wikipedia、OpenWebText等。数据集通过特定的策略进行了预处理和打包,以确保token的混合和样本的连续性。
This dataset contains preprocessed and token-packed `.bin` files intended for use in pretraining a decoder-only Transformer language model. Each `.bin` file contains a fixed number of samples, each 8192 tokens in length. Samples are grouped into batches of 125, totaling 1.024 million tokens per batch. The dataset includes tokens from a diverse mix of high-quality open datasets such as `C4 (en)`, `Wikipedia`, `OpenWebText`, and others. The dataset has been preprocessed and packed using specific strategies to ensure a balanced mix of tokens and continuity of samples.
提供机构:
wottAI



