RedPajama-Data-1T

Name: RedPajama-Data-1T
Creator: maas
Published: 2026-05-07 09:47:27
License: 暂无描述

魔搭社区2026-05-07 更新2024-06-08 收录

下载链接：

https://modelscope.cn/datasets/togethercomputer/RedPajama-Data-1T

下载链接

链接失效反馈

官方服务：

资源简介：

### Getting Started The dataset consists of 2084 jsonl files. You can download the dataset using HuggingFace: ```python from datasets import load_dataset ds = load_dataset("togethercomputer/RedPajama-Data-1T") ``` Or you can directly download the files using the following command: ``` wget 'https://data.together.xyz/redpajama-data-1T/v1.0.0/urls.txt' while read line; do dload_loc=${line#https://data.together.xyz/redpajama-data-1T/v1.0.0/} mkdir -p $(dirname $dload_loc) wget "$line" -O "$dload_loc" done < urls.txt ``` After downloading the files, you can load the dataset from disk by setting the `RED_PAJAMA_DATA_DIR` environment variable to the directory containing the files: ```python import os from datasets import load_dataset os.environ["RED_PAJAMA_DATA_DIR"] = "/path/to/download" ds = load_dataset("togethercomputer/RedPajama-Data-1T") ``` A smaller 1B-token sample of the dataset can be found [here](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T-Sample). A full set of scripts to recreate the dataset from scratch can be found [here](https://github.com/togethercomputer/RedPajama-Data). ### Dataset Summary RedPajama is a clean-room, fully open-source implementation of the LLaMa dataset. | Dataset | Token Count | |---------------|-------------| | Commoncrawl | 878 Billion | | C4 | 175 Billion | | GitHub | 59 Billion | | ArXiv | 28 Billion | | Wikipedia | 24 Billion | | StackExchange | 20 Billion | | Total | 1.2 Trillion | ### Languages Primarily English, though the Wikipedia slice contains multiple languages. ## Dataset Structure The dataset structure is as follows: ```json { "text": ..., "meta": {"url": "...", "timestamp": "...", "source": "...", "language": "...", ...}, "red_pajama_subset": "common_crawl" | "c4" | "github" | "arxiv" | "wikipedia" | "stackexchange" } ``` ## Dataset Creation This dataset was created to follow the LLaMa paper as closely as possible to try to reproduce its recipe. ### Source Data #### Commoncrawl We download five dumps from Commoncrawl, and run the dumps through the official `cc_net` pipeline. We then deduplicate on the paragraph level, and filter out low quality text using a linear classifier trained to classify paragraphs as Wikipedia references or random Commoncrawl samples. #### C4 C4 is downloaded from Huggingface. The only preprocessing step is to bring the data into our own format. #### GitHub The raw GitHub data is downloaded from Google BigQuery. We deduplicate on the file level and filter out low quality files and only keep projects that are distributed under the MIT, BSD, or Apache license. #### Wikipedia We use the Wikipedia dataset available on Huggingface, which is based on the Wikipedia dump from 2023-03-20 and contains text in 20 different languages. The dataset comes in preprocessed format, so that hyperlinks, comments and other formatting boilerplate has been removed. #### Gutenberg and Books3 <div class="course-tip course-tip-orange bg-gradient-to-br dark:bg-gradient-to-r before:border-orange-500 dark:before:border-orange-800 from-orange-50 dark:from-gray-900 to-white dark:to-gray-950 border border-orange-50 text-orange-700 dark:text-gray-400"> <p><b>Defunct:</b> The 'book' config is defunct and no longer accessible due to reported copyright infringement for the Book3 dataset contained in this config.</p> </div> #### ArXiv ArXiv data is downloaded from Amazon S3 in the `arxiv` requester pays bucket. We only keep latex source files and remove preambles, comments, macros and bibliographies. #### Stackexchange The Stack Exchange split of the dataset is download from the [Internet Archive](https://archive.org/download/stackexchange). Here we only keep the posts from the 28 largest sites, remove html tags, group the posts into question-answer pairs, and order answers by their score. ### SHA256 Checksums SHA256 checksums for the dataset files for each data source are available here: ``` https://data.together.xyz/redpajama-data-1T/v1.0.0/sha256/arxiv_SHA256SUMS.txt https://data.together.xyz/redpajama-data-1T/v1.0.0/sha256/c4_SHA256SUMS.txt https://data.together.xyz/redpajama-data-1T/v1.0.0/sha256/common_crawl_SHA256SUMS.txt https://data.together.xyz/redpajama-data-1T/v1.0.0/sha256/github_SHA256SUMS.txt https://data.together.xyz/redpajama-data-1T/v1.0.0/sha256/stackexchange_SHA256SUMS.txt https://data.together.xyz/redpajama-data-1T/v1.0.0/sha256/wikipedia_SHA256SUMS.txt ``` To cite RedPajama, please use: ``` @software{together2023redpajama, author = {Together Computer}, title = {RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset}, month = April, year = 2023, url = {https://github.com/togethercomputer/RedPajama-Data} } ``` ### License Please refer to the licenses of the data subsets you use. * [Common Crawl Foundation Terms of Use](https://commoncrawl.org/terms-of-use/full/) * [C4 license](https://huggingface.co/datasets/allenai/c4#license) * GitHub was limited to MIT, BSD, or Apache licenses only * [ArXiv Terms of Use](https://info.arxiv.org/help/api/tou.html) * [Wikipedia License](https://huggingface.co/datasets/wikipedia#licensing-information) * [StackExchange license on the Internet Archive](https://archive.org/details/stackexchange)

### 快速入门本数据集包含2084个JSON Lines（jsonl）格式文件。你可以通过HuggingFace下载该数据集： python from datasets import load_dataset ds = load_dataset("togethercomputer/RedPajama-Data-1T") 或者你也可以通过以下命令直接下载数据集文件： wget 'https://data.together.xyz/redpajama-data-1T/v1.0.0/urls.txt' while read line; dload_loc=${line#https://data.together.xyz/redpajama-data-1T/v1.0.0/} mkdir -p $(dirname $dload_loc) wget "$line" -O "$dload_loc" done < urls.txt 下载完成后，你可通过将`RED_PAJAMA_DATA_DIR`环境变量配置为数据集文件所在目录，从本地磁盘加载该数据集： python import os from datasets import load_dataset os.environ["RED_PAJAMA_DATA_DIR"] = "/path/to/download" ds = load_dataset("togethercomputer/RedPajama-Data-1T") 本数据集的10亿词元（Token）样本子集可从[此处](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T-Sample)获取。可从[此处](https://github.com/togethercomputer/RedPajama-Data)获取用于从头构建该数据集的完整脚本集。 ### 数据集概览 RedPajama是一款采用洁净室开发模式的全开源LLaMA数据集复刻实现。 | 数据集子集 | 词元（Token）数量 | |---------------|-------------| | Commoncrawl | 8780亿 | | C4 | 1750亿 | | GitHub | 590亿 | | ArXiv | 280亿 | | Wikipedia | 240亿 | | StackExchange | 200亿 | | 总计 | 1.2万亿 | ### 语言分布该数据集以英语为主，其中Wikipedia子集包含多种语言。 ### 数据集结构数据集结构如下： json { "text": ..., "meta": {"url": "...", "timestamp": "...", "source": "...", "language": "...", ...}, "red_pajama_subset": "common_crawl" | "c4" | "github" | "arxiv" | "wikipedia" | "stackexchange" } ### 数据集构建本数据集的构建旨在尽可能贴合LLaMA论文的方法，以复现其训练数据集的构建流程。 ### 源数据 #### Commoncrawl 我们从Commoncrawl下载了5个快照数据集，并通过官方的`cc_net`流程进行处理。随后我们在段落层面进行去重，并使用训练好的线性分类器过滤低质量文本——该分类器用于区分段落属于Wikipedia引用还是随机Commoncrawl样本。 #### C4 C4数据集从HuggingFace下载，仅需将数据转换为我们的格式即可完成预处理。 #### GitHub 原始GitHub数据从Google BigQuery下载，我们在文件层面进行去重，过滤低质量文件，仅保留采用MIT、BSD或Apache许可证的项目。 #### Wikipedia 我们使用HuggingFace上的Wikipedia数据集，该数据集基于2023年3月20日的Wikipedia快照，包含20种不同语言的文本。该数据集已完成预处理，移除了超链接、注释及其他格式冗余内容。 <div class="course-tip course-tip-orange bg-gradient-to-br dark:bg-gradient-to-r before:border-orange-500 dark:before:border-orange-800 from-orange-50 dark:from-gray-900 to-white dark:to-gray-950 border border-orange-50 text-orange-700 dark:text-gray-400"> <p><b>已废弃：</b>由于该配置中包含的Book3数据集被举报存在版权侵权问题，'book'配置现已废弃且无法访问。</p> </div> #### ArXiv ArXiv数据从Amazon S3的`arxiv`请求者付费存储桶中下载，我们仅保留LaTeX源文件，并移除前言、注释、宏定义和参考文献列表。 #### StackExchange 本数据集的StackExchange子集从[互联网档案馆（Internet Archive）](https://archive.org/download/stackexchange)下载。我们仅保留28个最大站点的帖子，移除HTML标签，将帖子整理为问答对，并按得分对回答进行排序。 ### SHA256校验和各数据源对应的数据集文件的SHA256校验和可通过以下链接获取： https://data.together.xyz/redpajama-data-1T/v1.0.0/sha256/arxiv_SHA256SUMS.txt https://data.together.xyz/redpajama-data-1T/v1.0.0/sha256/c4_SHA256SUMS.txt https://data.together.xyz/redpajama-data-1T/v1.0.0/sha256/common_crawl_SHA256SUMS.txt https://data.together.xyz/redpajama-data-1T/v1.0.0/sha256/github_SHA256SUMS.txt https://data.together.xyz/redpajama-data-1T/v1.0.0/sha256/stackexchange_SHA256SUMS.txt https://data.together.xyz/redpajama-data-1T/v1.0.0/sha256/wikipedia_SHA256SUMS.txt 如需引用RedPajama，请使用以下格式： @software{together2023redpajama, author = {Together Computer}, title = {RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset}, month = April, year = 2023, url = {https://github.com/togethercomputer/RedPajama-Data} } ### 许可证请遵循你所使用的数据子集对应的许可证条款。 * [Common Crawl基金会使用条款](https://commoncrawl.org/terms-of-use/full/) * [C4许可证](https://huggingface.co/datasets/allenai/c4#license) * GitHub数据集仅保留采用MIT、BSD或Apache许可证的项目 * [ArXiv使用条款](https://info.arxiv.org/help/api/tou.html) * [Wikipedia许可证](https://huggingface.co/datasets/wikipedia#licensing-information) * [互联网档案馆中的StackExchange许可证](https://archive.org/details/stackexchange)

提供机构：

maas

创建时间：

2025-11-18

搜集汇总

数据集介绍