ByteSpanTokenisers/finewebedu-20B
收藏Hugging Face2025-06-23 更新2025-08-30 收录
下载链接:
https://hf-mirror.com/datasets/ByteSpanTokenisers/finewebedu-20B
下载链接
链接失效反馈官方服务:
资源简介:
FineWebEDU 20B是一个用于训练器实验的数据集,包含了全数据集的bytelevel分词版本,以及用于训练、预测和评估的多个子集。还包括了使用不同训练器分词的数据集版本。
FineWebEDU 20B is a dataset for tokenizer experiments, including the full dataset tokenized with bytelevel, several subsets for training, prediction extraction, and evaluation, as well as various versions of the dataset tokenized with different trained tokenizers.
提供机构:
ByteSpanTokenisers



