five

streichjc/SlimPajama-6B

收藏
Hugging Face2025-12-07 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/streichjc/SlimPajama-6B
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en size_categories: - 1M<n<10M task_categories: - text-generation pretty_name: SlimPajama-6B configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* - split: test path: data/test-* dataset_info: features: - name: text dtype: string - name: meta struct: - name: redpajama_set_name dtype: string - name: __index_level_0__ dtype: int64 splits: - name: train num_bytes: 23918118724 num_examples: 5489000 - name: validation num_bytes: 39109042 num_examples: 9347 - name: test num_bytes: 40114950 num_examples: 9346 download_size: 14048972121 dataset_size: 23997342716 --- Sampled version of [cerebras/SlimPajama-627B](https://huggingface.co/datasets/cerebras/SlimPajama-627B). [Since the original data was shuffled before chunking](https://huggingface.co/datasets/cerebras/SlimPajama-627B/discussions/4), I only downloaded train/chunk1 (of 10 total) and further sampled 10%. This should result in roughly 6B tokens, hence SlimPajama-6B. The dataset is 24GBs in storage size when decompressed (original dataset is over 2TBs) and has 5489000 rows. The validation set and test set were sampled as well. --- #### Data source proportions for SlimPajama-627B and SlimPajama-6B For sanity purpose, I caluclated the byte proportion of the sampled version. | Data source | SlimPajama-627B | SlimPajama-6B | | ------------- | ---------- | --------- | | Commoncrawl | 52.2% | 54.1% | | C4 | 26.7% | 28.7% | | GitHub | 5.2% | 4.2% | | Books | 4.2% | 3.7% | | ArXiv | 4.6% | 3.4% | | Wikpedia | 3.8% | 3.1% | | StackExchange | 3.3% | 2.8% | --- Please refer to the original dataset for other info. ``` @misc{cerebras2023slimpajama, author = {Soboleva, Daria and Al-Khateeb, Faisal and Myers, Robert and Steeves, Jacob R and Hestness, Joel and Dey, Nolan}, title = {{SlimPajama: A 627B token cleaned and deduplicated version of RedPajama}}, month = June, year = 2023, howpublished = {\url{https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama}}, url = {https://huggingface.co/datasets/cerebras/SlimPajama-627B}, } ```
提供机构:
streichjc
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作