SpectraSuite/SlimPajama_300B
收藏Hugging Face2024-07-19 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/SpectraSuite/SlimPajama_300B
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
pretty_name: SlimPajama_300B
---
The SlimPajama_300B is a 300B token sample of de-duplicated Slim Pajama dataset tokenized using the [EleutherAI/gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b) tokenizer
Due to file size constraints, C4 and CommonCrawl has been uploaded in multiple chunks, you can use the following commands to merge them back into a single file:
```bash
cat C4_part_* > C4.bin
cat CommonCrawl_part_* > CommonCrawl.bin
```
#### Data Distribution
| Data source | Composition |
| ------------- | ------------------------------- |
| Commoncrawl | 0.5208 |
| C4 | 0.2668 |
| GitHub | 0.0522 |
| Books | 0.0420 |
| ArXiv | 0.0442 |
| Wikpedia | 0.0399 |
| StackExchange | 0.0337 |
语言:
- 英语
规范名称:SlimPajama_300B
---
SlimPajama_300B 是一个包含3000亿Token(Token)的去重Slim Pajama数据集样本,该数据集使用 [EleutherAI/gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b) 分词器完成分词处理。
受限于文件大小限制,C4 与 CommonCrawl 数据集已被拆分为多个分片上传,你可通过以下命令将其合并为单个文件:
bash
cat C4_part_* > C4.bin
cat CommonCrawl_part_* > CommonCrawl.bin
#### 数据分布
| 数据源 | 占比 |
| ------------- | ------------------------------- |
| CommonCrawl | 0.5208 |
| C4 | 0.2668 |
| GitHub | 0.0522 |
| Books | 0.0420 |
| ArXiv | 0.0442 |
| 维基百科(Wikipedia) | 0.0399 |
| StackExchange | 0.0337 |
提供机构:
SpectraSuite



