five

krisbailey/cosmopedia-10B

收藏
Hugging Face2026-01-22 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/krisbailey/cosmopedia-10B
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: odc-by task_categories: - text-generation language: - en tags: - cosmopedia - synthetic - 10B - parquet - large-text-corpus - general-text - web-crawl - cleaned-text - pretraining-data - unsupervised-learning - nlp - open-dataset - language-model-training size_categories: - 10B<n<100B --- # Cosmopedia 10B ## Dataset Description This is a **10.53 Billion token** subset of the [HuggingFaceTB/cosmopedia](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) dataset. It was created by sampling approximately **45%** of each subset (web_samples, stories, stanford, etc.) from the original dataset and deduplicating to ensure high utility. ## Motivation The original Cosmopedia dataset is massive (~25B+ tokens) and high quality. This 10B version serves as a "Goldilocks" dataset—large enough for meaningful pre-training experiments but small enough to iterate on quickly without massive compute resources. ## Dataset Details - **Total Tokens:** 10,531,801,761 (~10.53B) - **Source:** [HuggingFaceTB/cosmopedia](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) - **Structure:** Probabilistic sample of every original subset, augmented to reach strict 10B target. - **Format:** Parquet (Snappy compression) - **Producer:** Kris Bailey (kris@krisbailey.com) ## Usage ```python from datasets import load_dataset ds = load_dataset("krisbailey/cosmopedia-10B", split="train") print(ds[0]) ``` ## Citation Please cite the original Cosmopedia dataset: ```bibtex @article{benallal2024cosmopedia, title={Cosmopedia: How to create large-scale synthetic data for pre-training}, author={Loubna Ben Allal and Anton Lozhkov and Elie Bakouch and Gabriel Chimdyalwar and Leandro von Werra and Thomas Wolf}, year={2024}, journal={arXiv preprint arXiv:2402.13753} } ``` ## Data Mixture | Subset | Tokens | % of Total | | :--- | :--- | :--- | | `web_samples_v1` | 4,097,189,615 | 38.90% | | `web_samples_v2` | 3,337,500,285 | 31.69% | | `stories` | 1,188,075,064 | 11.28% | | `auto_math_text` | 914,988,722 | 8.69% | | `stanford` | 713,785,674 | 6.78% | | `openstax` | 147,042,763 | 1.40% | | `wikihow` | 120,689,663 | 1.15% | | `khanacademy` | 12,529,975 | 0.12% | | **Total** | **10,531,801,761** | **100.00%** |
提供机构:
krisbailey
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作