xfxcwynlc/fineweb100BT-hymba-tokenized
收藏Hugging Face2025-06-17 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/xfxcwynlc/fineweb100BT-hymba-tokenized
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是一个训练集,包含超过五千七百万个样本,数据类型包括input_ids、labels和attention_mask,分别存储为int32、int64和int8类型的序列数据。整个数据集大小约为1.5PB,下载大小约为443GB。
This dataset is a training set containing more than fifty-seven million samples, with data types including input_ids, labels, and attention_mask stored as sequences of int32, int64, and int8 types respectively. The entire dataset is approximately 1.5PB in size, with a download size of about 443GB.
提供机构:
xfxcwynlc



