kanahia1/RedPajama-2B-Sample
收藏Hugging Face2026-02-08 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/kanahia1/RedPajama-2B-Sample
下载链接
链接失效反馈官方服务:
资源简介:
## Dataset Summary
This dataset is a **2 Billion token subset** of the [RedPajama-Data-1T](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T) dataset created by Together Computer.
It was created to provide a smaller, lightweight version of the massive 1.2 Trillion token dataset for testing, debugging, and small-scale pre-training experiments. This specific subset was streamed primarily from the **C4 (Colossal Clean Crawled Corpus)** partition of the original RedPajama dataset.
## Dataset Structure
The dataset retains the original structure of RedPajama. Each row contains:
- **text**: The actual content (string).
- **meta**: Metadata dict containing `url`, `timestamp`, `source`, etc.
- **subset**: The name of the RedPajama subset this sample came from (e.g., `c4`).
## Source Data
This dataset is a direct downstream derivative of **RedPajama-Data-1T**.
- **Original Repository:** [togethercomputer/RedPajama-Data-1T](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T)
- **Original Author:** Together Computer
- **Subset Used:** `c4` (Colossal Clean Crawled Corpus)
## Creation Process
This dataset was generated using the Hugging Face `datasets` library with the following methodology:
1. **Direct Streaming:** Data was streamed directly from the original source files to avoid downloading the full 5TB dataset.
2. **Selection:** The first ~2 Billion tokens were selected from the `c4` split.
3. **Token Counting:** Token counts were estimated using the `EleutherAI/gpt-neox-20b` tokenizer (or a 1 token ≈ 4 characters approximation).
## Licensing Information
Since this is a subset of RedPajama, it inherits the licensing terms of the original data.
- **RedPajama-Data-1T** is distributed under the licenses of its underlying data sources.
- **C4 (Common Crawl)**: Terms of Use are available [here](https://commoncrawl.org/terms-of-use/).
- **Code/Scripts**: The code used to generate this subset is Open Source.
## Citation
If you use this dataset, please cite the original RedPajama paper:
```bibtex
@software{together2023redpajama,
author = {Together Computer},
title = {RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset},
month = April,
year = 2023,
url = {[https://github.com/togethercomputer/RedPajama-Data](https://github.com/togethercomputer/RedPajama-Data)}
}
## 数据集摘要
本数据集是由Together Computer创建的[RedPajama-Data-1T](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T)数据集的**20亿Token子集**。
其构建目的是为这个规模达1.2万亿Token的超大型数据集提供一个轻量化的小型版本,用于测试、调试以及小规模预训练实验。该特定子集主要源自原始RedPajama数据集中的**C4 (Colossal Clean Crawled Corpus,大规模干净爬取语料库)**分区。
## 数据集结构
本数据集保留了RedPajama的原始结构。每一行数据包含以下字段:
- **text**:实际文本内容(字符串类型)。
- **meta**:元数据字典,包含`url`、`timestamp`、`source`等信息。
- **subset**:该样本所属的RedPajama子集名称(例如`c4`)。
## 源数据
本数据集是**RedPajama-Data-1T**的直接下游衍生数据集。
- **原始仓库**:[togethercomputer/RedPajama-Data-1T](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T)
- **原始作者**:Together Computer
- **所用子集**:`c4` (Colossal Clean Crawled Corpus,大规模干净爬取语料库)
## 构建流程
本数据集通过Hugging Face `datasets`库生成,具体方法如下:
1. **直接流式读取**:直接从原始源文件流式读取数据,避免下载完整的5TB数据集。
2. **筛选选取**:从`c4`数据分片中选取了前约20亿个Token。
3. **Token计数**:使用`EleutherAI/gpt-neox-20b`分词器估算Token数量(或采用1 Token≈4字符的近似方法)。
## 许可信息
由于本数据集是RedPajama的子集,因此继承了原始数据的许可条款。
- **RedPajama-Data-1T**按照其底层数据源的许可协议进行分发。
- **C4 (Common Crawl)**:使用条款可参见[此处](https://commoncrawl.org/terms-of-use/)。
- **代码/脚本**:用于生成本子集的代码为开源代码。
## 引用说明
若使用本数据集,请引用原始RedPajama论文:
bibtex
@software{together2023redpajama,
author = {Together Computer},
title = {RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset},
month = April,
year = 2023,
url = {[https://github.com/togethercomputer/RedPajama-Data](https://github.com/togethercomputer/RedPajama-Data)}
}
提供机构:
kanahia1



