nikolina-p/fineweb_10BT_tokenized
收藏Hugging Face2025-10-31 更新2025-11-15 收录
下载链接:
https://hf-mirror.com/datasets/nikolina-p/fineweb_10BT_tokenized
下载链接
链接失效反馈官方服务:
资源简介:
FineWeb-Edu 10B tokenized 数据集包含从FineWeb-Edu样本-10B中标记化的文本。该数据集使用OpenAI的tiktoken标记器进行标记化,并为高效的数据流和分布式(DDP)训练进行了结构化。数据集遵循Hugging Face推荐的结构,以便在多GPU环境中高效流式传输。它包括两个部分:训练集和验证集,每个部分包含一定数量的分片,以便在GPU节点之间高效分配。
The FineWeb-Edu 10B tokenized dataset contains tokenized texts from the FineWeb-Edu sample-10B. The dataset has been tokenized using OpenAIs tiktoken tokenizer and structured for efficient streaming and distributed (DDP) training. It follows Hugging Face’s recommended structure for efficient streaming in multi-GPU environments, consisting of two splits: train and validation, each with specific shard configurations for distribution across GPU nodes.
提供机构:
nikolina-p



