krisbailey/falcon-refinedweb-1B

Name: krisbailey/falcon-refinedweb-1B
Creator: krisbailey
Published: 2026-01-22 20:06:35
License: 暂无描述

Hugging Face2026-01-22 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/krisbailey/falcon-refinedweb-1B

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - text-generation language: - en tags: - falcon - refinedweb - 1B - parquet - web-refined - text-generation - clean-web-corpus - llm-pretrain - domain-agnostic - sentence-quality-filtered - huggingface-refinedweb size_categories: - 1B<n<10B --- # Falcon RefinedWeb 1B ## Dataset Description This is a **1.01 Billion token** subset of the [tiiuae/falcon-refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) dataset. It was created by streaming the dataset with a large shuffle buffer to ensure a random, representative sample of the web data. ## Motivation RefinedWeb is a high-quality filtered web dataset, but the full version is massive. This 1B token slice provides a perfect testbed for evaluating model architecture changes or for use in curriculum learning experiments. ## Dataset Details - **Total Tokens:** 1,005,000,041 (~1.01B) - **Source:** [tiiuae/falcon-refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) - **Method:** Streamed with shuffle buffer (size=5000). - **Format:** Parquet (Snappy compression) - **Producer:** Kris Bailey (kris@krisbailey.com) ## Usage ```python from datasets import load_dataset ds = load_dataset("krisbailey/falcon-refinedweb-1B", split="train") print(ds[0]) ``` ## Subsets & Slicing Since this dataset was randomly shuffled during creation, you can safely slice it to get smaller, representative datasets (e.g., for scaling laws experiments) without needing to download the full dataset. ```python # 100M Token Subset (approx 10%) ds_100m = load_dataset("krisbailey/falcon-refinedweb-1B", split="train[:10%]") # 500M Token Subset (approx 50%) ds_500m = load_dataset("krisbailey/falcon-refinedweb-1B", split="train[:50%]") ``` ## Citation ```bibtex @article{penedo2023refinedweb, title={The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only}, author={Penedo, Guilherme and Malartic, Quentin and Hesslow, Daniel and Cojocaru, Ruxandra and Cappelli, Alessandro and Alobeidli, Hamza and Pannier, Baptiste and Almazrouei, Ebtesam and Launay, Julien}, journal={arXiv preprint arXiv:2306.01116}, year={2023} } ```

提供机构：

krisbailey

5,000+

优质数据集

54 个

任务类型

进入经典数据集