RedPajama-1B
收藏arXiv2025-09-30 收录
下载链接:
https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T-Sample
下载链接
链接失效反馈官方服务:
资源简介:
该数据集名为RedPajama-1B,包含了来自不同领域的数据样本,如Arxiv、Wikipedia、书籍、Common-Crawl、C4、Stackexchange和Github。该数据集被用于评估不可约课程学习算法在语言模型训练中的有效性。它是一个大规模的数据集,包含了来自7个领域的数据样本,所涉及的任务是语言模型的预训练。
The dataset named RedPajama-1B includes data samples from various domains such as Arxiv, Wikipedia, books, Common-Crawl, C4, Stackexchange, and Github. This dataset is utilized to evaluate the effectiveness of irreducible curriculum learning algorithms in language model training. It is a large-scale dataset containing data samples from 7 domains, with the corresponding task being language model pre-training.
提供机构:
Together Computer



