zhihanyang/SlimPajama-627B_Reupload
收藏Hugging Face2026-04-26 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/zhihanyang/SlimPajama-627B_Reupload
下载链接
链接失效反馈官方服务:
资源简介:
SlimPajama-627B是一个大规模文本数据集,包含约627B个token的文本数据,用于自然语言处理任务。数据集由text和meta两个字段组成,其中text字段存储文本内容,meta字段包含redpajama_set_name,表示数据来源的RedPajama子集名称。数据集分为训练集、验证集和测试集,分别包含590,394,625、502,556和502,268个样本。由于原始数据集由大量小文件组成,下载困难,此版本被重新上传为更大的分块文件,以便于高效下载和处理。原始数据集由Cerebras提供,旨在支持大规模语言模型训练和研究。
SlimPajama-627B is a large-scale text dataset containing approximately 627B tokens, designed for natural language processing tasks. The dataset consists of two fields: text for the textual content and meta which includes redpajama_set_name indicating the source subset from RedPajama. It is split into train, validation, and test sets with 590,394,625, 502,556, and 502,268 examples respectively. Due to the original dataset being composed of numerous small files that are difficult to download, this version has been reuploaded in larger chunks to facilitate easy downloading and processing. The original dataset is provided by Cerebras and aims to support large-scale language model training and research.
提供机构:
zhihanyang



