RedPajama-Data-1T

Opencsg2024-07-19 更新2025-05-03 收录

下载链接：

https://www.opencsg.com/datasets/AIWizards/RedPajama-Data-1T

下载链接

链接失效反馈

官方服务：

资源简介：

Red Pajama 1T旨在提供一个完全开源的LLaMa数据集的复现版本，主要用于文本生成任务。数据集包含超过1万亿的token，主要为英文，但也包含多种语言的维基百科数据。数据来源于Commoncrawl、C4、GitHub、ArXiv、Wikipedia和StackExchange等，并经过了诸如段落去重、低质量文本过滤、文件去重、HTML标签移除等预处理步骤。用户可以通过Hugging Face下载或直接下载文件，并提供了SHA256校验和以验证数据完整性。数据集的使用需遵循各个数据子集的授权许可，包括Common Crawl、C4、GitHub、ArXiv、Wikipedia和StackExchange的许可协议。

Red Pajama 1T aims to provide a fully open-source reproduction of the LLaMa dataset, primarily intended for text generation tasks. The dataset contains over 1 trillion tokens, predominantly in English, while also including multilingual Wikipedia data. The data is sourced from Commoncrawl, C4, GitHub, ArXiv, Wikipedia, StackExchange and other resources, and has undergone preprocessing steps including paragraph deduplication, low-quality text filtering, file deduplication and HTML tag removal. Users can download the dataset via Hugging Face or directly obtain the files, with SHA256 checksums provided to verify data integrity. Usage of the dataset must comply with the license agreements of each respective data subset, including those of Common Crawl, C4, GitHub, ArXiv, Wikipedia and StackExchange.

创建时间：

2024-07-19

搜集汇总

数据集介绍

背景与挑战

背景概述

RedPajama-Data-1T是一个完全开源的文本生成数据集，包含超过1万亿token，主要来源于Commoncrawl、C4等多个公开数据源，并经过严格预处理。数据集支持多种下载方式，并需遵循各子集的许可协议。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集