five

kanahia1/RedPajama-2B-Sample

收藏
Hugging Face2026-02-08 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/kanahia1/RedPajama-2B-Sample
下载链接
链接失效反馈
官方服务:
资源简介:
## Dataset Summary This dataset is a **2 Billion token subset** of the [RedPajama-Data-1T](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T) dataset created by Together Computer. It was created to provide a smaller, lightweight version of the massive 1.2 Trillion token dataset for testing, debugging, and small-scale pre-training experiments. This specific subset was streamed primarily from the **C4 (Colossal Clean Crawled Corpus)** partition of the original RedPajama dataset. ## Dataset Structure The dataset retains the original structure of RedPajama. Each row contains: - **text**: The actual content (string). - **meta**: Metadata dict containing `url`, `timestamp`, `source`, etc. - **subset**: The name of the RedPajama subset this sample came from (e.g., `c4`). ## Source Data This dataset is a direct downstream derivative of **RedPajama-Data-1T**. - **Original Repository:** [togethercomputer/RedPajama-Data-1T](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T) - **Original Author:** Together Computer - **Subset Used:** `c4` (Colossal Clean Crawled Corpus) ## Creation Process This dataset was generated using the Hugging Face `datasets` library with the following methodology: 1. **Direct Streaming:** Data was streamed directly from the original source files to avoid downloading the full 5TB dataset. 2. **Selection:** The first ~2 Billion tokens were selected from the `c4` split. 3. **Token Counting:** Token counts were estimated using the `EleutherAI/gpt-neox-20b` tokenizer (or a 1 token ≈ 4 characters approximation). ## Licensing Information Since this is a subset of RedPajama, it inherits the licensing terms of the original data. - **RedPajama-Data-1T** is distributed under the licenses of its underlying data sources. - **C4 (Common Crawl)**: Terms of Use are available [here](https://commoncrawl.org/terms-of-use/). - **Code/Scripts**: The code used to generate this subset is Open Source. ## Citation If you use this dataset, please cite the original RedPajama paper: ```bibtex @software{together2023redpajama, author = {Together Computer}, title = {RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset}, month = April, year = 2023, url = {[https://github.com/togethercomputer/RedPajama-Data](https://github.com/togethercomputer/RedPajama-Data)} }

## 数据集摘要 本数据集是由Together Computer创建的[RedPajama-Data-1T](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T)数据集的**20亿Token子集**。 其构建目的是为这个规模达1.2万亿Token的超大型数据集提供一个轻量化的小型版本,用于测试、调试以及小规模预训练实验。该特定子集主要源自原始RedPajama数据集中的**C4 (Colossal Clean Crawled Corpus,大规模干净爬取语料库)**分区。 ## 数据集结构 本数据集保留了RedPajama的原始结构。每一行数据包含以下字段: - **text**:实际文本内容(字符串类型)。 - **meta**:元数据字典,包含`url`、`timestamp`、`source`等信息。 - **subset**:该样本所属的RedPajama子集名称(例如`c4`)。 ## 源数据 本数据集是**RedPajama-Data-1T**的直接下游衍生数据集。 - **原始仓库**:[togethercomputer/RedPajama-Data-1T](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T) - **原始作者**:Together Computer - **所用子集**:`c4` (Colossal Clean Crawled Corpus,大规模干净爬取语料库) ## 构建流程 本数据集通过Hugging Face `datasets`库生成,具体方法如下: 1. **直接流式读取**:直接从原始源文件流式读取数据,避免下载完整的5TB数据集。 2. **筛选选取**:从`c4`数据分片中选取了前约20亿个Token。 3. **Token计数**:使用`EleutherAI/gpt-neox-20b`分词器估算Token数量(或采用1 Token≈4字符的近似方法)。 ## 许可信息 由于本数据集是RedPajama的子集,因此继承了原始数据的许可条款。 - **RedPajama-Data-1T**按照其底层数据源的许可协议进行分发。 - **C4 (Common Crawl)**:使用条款可参见[此处](https://commoncrawl.org/terms-of-use/)。 - **代码/脚本**:用于生成本子集的代码为开源代码。 ## 引用说明 若使用本数据集,请引用原始RedPajama论文: bibtex @software{together2023redpajama, author = {Together Computer}, title = {RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset}, month = April, year = 2023, url = {[https://github.com/togethercomputer/RedPajama-Data](https://github.com/togethercomputer/RedPajama-Data)} }
提供机构:
kanahia1
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作