five

krisbailey/RedPajama-10B-Weighted

收藏
Hugging Face2026-01-09 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/krisbailey/RedPajama-10B-Weighted
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-generation language: - en tags: - redpajama - llm - dataset-reproduction - redpajama-10b - redpajama-subset - redpajama-weighted - redpajama-sample - natural-language-processing size_categories: - 1B<n<10B pretty_name: RedPajama 10B Weighted Subset --- # RedPajama-10B-Weighted A **canonical 10 Billion token weighted subset** of the [RedPajama-Data-1T](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T) dataset. ## Dataset Description This dataset is a faithful reproduction of the original RedPajama-Data-1T distribution, scaled down to exactly **10 Billion tokens**. It is designed to preserve the **exact domain ratios** of the original dataset (excluding the defunct 'Books' subset). This allows researchers and developers to prototype, debug, and test on a representative slice of the data without needing to download or process the full 1 Terabyte dataset. It serves as both a standalone dataset for medium-scale experiments and the parent source for smaller slices (like the [1B subset](https://huggingface.co/datasets/krisbailey/RedPajama-1B-Weighted)). ## Dataset Details - **Total Tokens:** ~10,000,000,000 (10 Billion) - **Source:** [togethercomputer/RedPajama-Data-1T](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T) - **Language:** English - **Format:** Apache Parquet - **Producer:** Kris Bailey ## Motivation The original RedPajama dataset is a standard for open-source LLM training, but its size (1TB+) makes it unwieldy for quick iteration, debugging, or educational purposes. Randomly sampling without care can destroy the delicate balance of data sources (CommonCrawl vs. C4 vs. GitHub). This **RedPajama 10B subset** solves this by using a **weighted interleaving strategy** that strictly adheres to the original mixing ratios. It ensures that even at a smaller scale, the data seen by the model is distributionally equivalent to the full run. ## Dataset Creation Process The creation process involved a precise streaming and interleaving pipeline: ### 1. Source Streaming We streamed the data directly from `togethercomputer/RedPajama-Data-1T` to avoid local storage bottlenecks. ### 2. Weighted Interleaving We defined the target probabilities based on the original token counts: - **CommonCrawl:** 74.16% - **C4:** 14.78% - **GitHub:** 4.98% - **ArXiv:** 2.36% - **Wikipedia:** 2.03% - **StackExchange:** 1.69% An interleaving algorithm sampled from these streams according to these probabilities to construct a single, unified stream. ### 3. Buffer Shuffling To avoid burstiness (e.g., seeing 1000 Wikipedia articles in a row), we implemented a **buffer shuffle** with a size of 10,000 documents. This ensures a healthy mixture of domains throughout the dataset. ### 4. Verification The process ran until exactly 10 Billion tokens were collected. We verified that the final composition matches the target weights. ## Composition | Subset | Weight | Approx. Tokens | | :--- | :--- | :--- | | **CommonCrawl** | 74.16% | ~7.42 B | | **C4** | 14.78% | ~1.48 B | | **GitHub** | 4.98% | ~0.50 B | | **ArXiv** | 2.36% | ~0.24 B | | **Wikipedia** | 2.03% | ~0.20 B | | **StackExchange** | 1.69% | ~0.17 B | ## Usage ```python from datasets import load_dataset # Load the 10B weighted subset ds = load_dataset("krisbailey/RedPajama-10B-Weighted", split="train") print(ds) ``` ## Citation If you use this dataset, please cite the original RedPajama work: ```bibtex @software{together2023redpajama, author = {Together Computer}, title = {RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset}, month = April, year = 2023, url = {https://github.com/togethercomputer/RedPajama-Data} } ```
提供机构:
krisbailey
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作