RedPajama-V2
收藏OpenXLab2026-04-18 收录
下载链接:
https://openxlab.org.cn/datasets/OpenDataLab/RedPajama-V2
下载链接
链接失效反馈官方服务:
资源简介:
RedPajama-V2 is an open dataset for training large language models. The dataset includes over 100B text documents coming from 84 CommonCrawl snapshots and processed using the CCNet pipeline. Out of these, there are 30B documents in the corpus that additionally come with quality signals. In addition, we also provide the ids of duplicated documents which can be used to create a dataset with 20B deduplicated documents.
提供机构:
OpenDataLab
创建时间:
2024-05-14



