BroAlanTaps/Pretrain-Stage1-512
收藏Hugging Face2025-03-12 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/BroAlanTaps/Pretrain-Stage1-512
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含多个字段,包括文本字段、目标字段、压缩ID序列、LLM ID序列、下一个ID序列和token数量。数据集分为训练集和测试集,其中训练集包含超过600万个示例,而测试集包含608个示例。数据集的总大小超过100亿字节。
The dataset contains multiple fields, including text, target, compressed ID sequences, LLM ID sequences, next ID sequences, and token counts. The dataset is split into training and test sets, with the training set containing over 6 million examples and the test set containing 608 examples. The total size of the dataset is over 10 billion bytes.
提供机构:
BroAlanTaps



