krisbailey/cosmopedia-1b

Name: krisbailey/cosmopedia-1b
Creator: krisbailey
Published: 2026-01-22 20:03:17
License: 暂无描述

Hugging Face2026-01-22 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/krisbailey/cosmopedia-1b

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: odc-by task_categories: - text-generation language: - en tags: - cosmopedia - synthetic - 1B - parquet - large-text-corpus - general-text - web-crawl - cleaned-text - pretraining-data - unsupervised-learning - nlp - open-dataset - language-model-training size_categories: - 1B<n<10B --- # Cosmopedia 1B ## Dataset Description This is a **1 Billion token** subset of the [krisbailey/cosmopedia-10B](https://huggingface.co/datasets/krisbailey/cosmopedia-10B) dataset, which itself is a 10B subset of [HuggingFaceTB/cosmopedia](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia). It was created by uniformly sampling approximately **9.5%** of the 10B dataset, ensuring the data distribution remains consistent with the source. ## Motivation While the 10B dataset is a "Goldilocks" size for many experiments, **1B tokens** is the standard size for rapid prototyping, scaling law verification, and educational use. This dataset allows for training substantial models (e.g., TinyLlama size) in a matter of hours on consumer hardware. ## Dataset Details - **Total Tokens:** 1,005,041,188 (~1.01B) - **Source:** [krisbailey/cosmopedia-10B](https://huggingface.co/datasets/krisbailey/cosmopedia-10B) - **Structure:** Uniform random sample of the 10B dataset. - **Format:** Parquet (Snappy compression) - **Producer:** Kris Bailey (kris@krisbailey.com) ## Usage ```python from datasets import load_dataset ds = load_dataset("krisbailey/cosmopedia-1b", split="train") print(ds[0]) ``` ## Subsets & Slicing Since this dataset was randomly shuffled during creation, you can safely slice it to get smaller, representative datasets (e.g., for scaling laws experiments) without needing to download the full dataset. ```python # 100M Token Subset (approx 10%) ds_100m = load_dataset("krisbailey/cosmopedia-1b", split="train[:10%]") # 500M Token Subset (approx 50%) ds_500m = load_dataset("krisbailey/cosmopedia-1b", split="train[:50%]") ``` ## Citation Please cite the original Cosmopedia dataset: ```bibtex @article{benallal2024cosmopedia, title={Cosmopedia: How to create large-scale synthetic data for pre-training}, author={Loubna Ben Allal and Anton Lozhkov and Elie Bakouch and Gabriel Chimdyalwar and Leandro von Werra and Thomas Wolf}, year={2024}, journal={arXiv preprint arXiv:2402.13753} } ``` ## Data Mixture | Subset | Tokens | % of Total | | :--- | :--- | :--- | | `web_samples_v1` | 388,873,981 | 38.69% | | `web_samples_v2` | 320,204,851 | 31.86% | | `stories` | 111,953,618 | 11.14% | | `auto_math_text` | 85,656,677 | 8.52% | | `stanford` | 70,987,312 | 7.06% | | `wikihow` | 16,019,867 | 1.59% | | `openstax` | 9,294,289 | 0.92% | | `khanacademy` | 2,050,593 | 0.20% | | **Total** | **1,005,041,188** | **100.00%** |

提供机构：

krisbailey

5,000+

优质数据集

54 个

任务类型

进入经典数据集