nikolina-p/mini_gutenberg_flat
收藏Hugging Face2025-10-20 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/nikolina-p/mini_gutenberg_flat
下载链接
链接失效反馈官方服务:
资源简介:
Mini Project Gutenberg数据集是一个经过清理的英文子集,已经过分词处理。它由nikolina-p/gutenberg_flat数据集的前24本书构成,用于学习、测试流数据集、分布式数据并行训练和快速实验。数据集的结构适用于分布式环境中自回归模型的训练,每个分割包含8个片段,所有片段中的令牌数量相同,每行由16×1,024 + 1个令牌组成。
The Mini Project Gutenberg Dataset is a cleaned English subset that has been tokenized. It consists of the first 24 books from the nikolina-p/gutenberg_flat dataset, created for learning, testing streaming datasets, DDP training, and quick experimentation. The datasets structure is adapted for training autoregressive models in a distributed environment, with each split containing 8 shards, all shards within a split having the same number of tokens, and each row consisting of 16×1,024 + 1 tokens.
提供机构:
nikolina-p



