gabrielagc/chunked-fineweb-edu
收藏Hugging Face2024-07-26 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/gabrielagc/chunked-fineweb-edu
下载链接
链接失效反馈官方服务:
资源简介:
FineWeb-Edu数据集的样本被分割成512个令牌的块。每个序列的开始(前20个令牌)被附加到后续的块中。只有令牌计数大于512的80%的样本被包含在最终的数据集中。
The FineWeb-Edu dataset includes four features: text, id, token_count, and preceding_token_count. The text and id are of string type, while token_count and preceding_token_count are of integer type. The dataset is divided into a training set, containing 2,658,953 samples with a total size of 5,862,530,740.527941 bytes. The samples in the dataset are divided into chunks of 512 tokens, with the first 20 tokens of each sequence prepended to the subsequent chunks. Only samples with a token count greater than 80% of 512 are included in the final dataset.
提供机构:
gabrielagc



