kanahia1/RedPajama-2B-Sample

Name: kanahia1/RedPajama-2B-Sample
Creator: kanahia1
Published: 2026-02-08 11:25:45
License: 暂无描述

Hugging Face2026-02-08 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/kanahia1/RedPajama-2B-Sample

下载链接

链接失效反馈

官方服务：

资源简介：

## Dataset Summary This dataset is a **2 Billion token subset** of the [RedPajama-Data-1T](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T) dataset created by Together Computer. It was created to provide a smaller, lightweight version of the massive 1.2 Trillion token dataset for testing, debugging, and small-scale pre-training experiments. This specific subset was streamed primarily from the **C4 (Colossal Clean Crawled Corpus)** partition of the original RedPajama dataset. ## Dataset Structure The dataset retains the original structure of RedPajama. Each row contains: - **text**: The actual content (string). - **meta**: Metadata dict containing `url`, `timestamp`, `source`, etc. - **subset**: The name of the RedPajama subset this sample came from (e.g., `c4`). ## Source Data This dataset is a direct downstream derivative of **RedPajama-Data-1T**. - **Original Repository:** [togethercomputer/RedPajama-Data-1T](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T) - **Original Author:** Together Computer - **Subset Used:** `c4` (Colossal Clean Crawled Corpus) ## Creation Process This dataset was generated using the Hugging Face `datasets` library with the following methodology: 1. **Direct Streaming:** Data was streamed directly from the original source files to avoid downloading the full 5TB dataset. 2. **Selection:** The first ~2 Billion tokens were selected from the `c4` split. 3. **Token Counting:** Token counts were estimated using the `EleutherAI/gpt-neox-20b` tokenizer (or a 1 token ≈ 4 characters approximation). ## Licensing Information Since this is a subset of RedPajama, it inherits the licensing terms of the original data. - **RedPajama-Data-1T** is distributed under the licenses of its underlying data sources. - **C4 (Common Crawl)**: Terms of Use are available [here](https://commoncrawl.org/terms-of-use/). - **Code/Scripts**: The code used to generate this subset is Open Source. ## Citation If you use this dataset, please cite the original RedPajama paper: ```bibtex @software{together2023redpajama, author = {Together Computer}, title = {RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset}, month = April, year = 2023, url = {[https://github.com/togethercomputer/RedPajama-Data](https://github.com/togethercomputer/RedPajama-Data)} }

## 数据集摘要本数据集是由Together Computer创建的[RedPajama-Data-1T](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T)数据集的**20亿Token子集**。其构建目的是为这个规模达1.2万亿Token的超大型数据集提供一个轻量化的小型版本，用于测试、调试以及小规模预训练实验。该特定子集主要源自原始RedPajama数据集中的**C4 (Colossal Clean Crawled Corpus，大规模干净爬取语料库)**分区。 ## 数据集结构本数据集保留了RedPajama的原始结构。每一行数据包含以下字段： - **text**：实际文本内容（字符串类型）。 - **meta**：元数据字典，包含`url`、`timestamp`、`source`等信息。 - **subset**：该样本所属的RedPajama子集名称（例如`c4`）。 ## 源数据本数据集是**RedPajama-Data-1T**的直接下游衍生数据集。 - **原始仓库**：[togethercomputer/RedPajama-Data-1T](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T) - **原始作者**：Together Computer - **所用子集**：`c4` (Colossal Clean Crawled Corpus，大规模干净爬取语料库) ## 构建流程本数据集通过Hugging Face `datasets`库生成，具体方法如下： 1. **直接流式读取**：直接从原始源文件流式读取数据，避免下载完整的5TB数据集。 2. **筛选选取**：从`c4`数据分片中选取了前约20亿个Token。 3. **Token计数**：使用`EleutherAI/gpt-neox-20b`分词器估算Token数量（或采用1 Token≈4字符的近似方法）。 ## 许可信息由于本数据集是RedPajama的子集，因此继承了原始数据的许可条款。 - **RedPajama-Data-1T**按照其底层数据源的许可协议进行分发。 - **C4 (Common Crawl)**：使用条款可参见[此处](https://commoncrawl.org/terms-of-use/)。 - **代码/脚本**：用于生成本子集的代码为开源代码。 ## 引用说明若使用本数据集，请引用原始RedPajama论文： bibtex @software{together2023redpajama, author = {Together Computer}, title = {RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset}, month = April, year = 2023, url = {[https://github.com/togethercomputer/RedPajama-Data](https://github.com/togethercomputer/RedPajama-Data)} }

提供机构：

kanahia1

5,000+

优质数据集

54 个

任务类型

进入经典数据集