SpectraSuite/SlimPajama_300B

Name: SpectraSuite/SlimPajama_300B
Creator: SpectraSuite
Published: 2024-07-19 06:48:17
License: 暂无描述

Hugging Face2024-07-19 更新2025-04-12 收录

下载链接：

https://hf-mirror.com/datasets/SpectraSuite/SlimPajama_300B

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en pretty_name: SlimPajama_300B --- The SlimPajama_300B is a 300B token sample of de-duplicated Slim Pajama dataset tokenized using the [EleutherAI/gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b) tokenizer Due to file size constraints, C4 and CommonCrawl has been uploaded in multiple chunks, you can use the following commands to merge them back into a single file: ```bash cat C4_part_* > C4.bin cat CommonCrawl_part_* > CommonCrawl.bin ``` #### Data Distribution | Data source | Composition | | ------------- | ------------------------------- | | Commoncrawl | 0.5208 | | C4 | 0.2668 | | GitHub | 0.0522 | | Books | 0.0420 | | ArXiv | 0.0442 | | Wikpedia | 0.0399 | | StackExchange | 0.0337 |

语言： - 英语规范名称：SlimPajama_300B --- SlimPajama_300B 是一个包含3000亿Token（Token）的去重Slim Pajama数据集样本，该数据集使用 [EleutherAI/gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b) 分词器完成分词处理。受限于文件大小限制，C4 与 CommonCrawl 数据集已被拆分为多个分片上传，你可通过以下命令将其合并为单个文件： bash cat C4_part_* > C4.bin cat CommonCrawl_part_* > CommonCrawl.bin #### 数据分布 | 数据源 | 占比 | | ------------- | ------------------------------- | | CommonCrawl | 0.5208 | | C4 | 0.2668 | | GitHub | 0.0522 | | Books | 0.0420 | | ArXiv | 0.0442 | | 维基百科（Wikipedia） | 0.0399 | | StackExchange | 0.0337 |

提供机构：

SpectraSuite

5,000+

优质数据集

54 个

任务类型

进入经典数据集