JonathanMiddleton/fineweb-edu-dedup-shuffled

Name: JonathanMiddleton/fineweb-edu-dedup-shuffled
Creator: JonathanMiddleton
Published: 2026-03-06 17:59:36
License: 暂无描述

Hugging Face2026-03-06 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/JonathanMiddleton/fineweb-edu-dedup-shuffled

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: odc-by size_categories: - 100M<n<1B task_categories: - text-generation tags: - pretraining - shuffled - fineweb - education dataset_info: features: - name: text dtype: large_string - name: _source_index dtype: int64 splits: - name: train num_examples: 190168005 pretty_name: FineWeb-Edu-Dedup (Globally Shuffled) --- # FineWeb-Edu-Dedup (Globally Shuffled) A uniformly shuffled version of the FineWeb-Edu-Dedup subset from SmolLM-Corpus by HuggingFace. ## Source Data This dataset is derived from [HuggingFaceTB/smollm-corpus](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus), specifically the `fineweb-edu-dedup` subset. That subset is itself derived from [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu), a filtered and deduplicated extract of [Common Crawl](https://commoncrawl.org/) selected for educational content quality. | Property | Value | |---|---| | Source dataset | [`HuggingFaceTB/smollm-corpus`](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus) | | Source subset | `fineweb-edu-dedup` | | Upstream origin | [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) via Common Crawl | | Source files | 234 parquet files | | Total rows | 190,168,005 | | Output files | 381 parquet files (~499K rows each) | | Shuffle seed | `42` | | Compression | zstd | The text content is **byte-identical** to the source. No filtering, deduplication, or transformation has been applied beyond reordering rows. Each output row includes a `_source_index` column recording the row's original position in the source dataset for full traceability. ## Motivation The upstream FineWeb-Edu-Dedup parquet files are organized by Common Crawl dump, producing temporal and topical clustering: consecutive rows tend to come from the same crawl, the same domains, and similar subject matter. When pretokenized training shards are built by reading these files sequentially, this clustering propagates into the training data, reducing gradient diversity during pretraining. This dataset eliminates that ordering bias by applying a provably uniform global shuffle to all 190 million rows. ## Schema | Column | Type | Description | |---|---|---| | `text` | `large_string` | Document text, byte-identical to the source | | `_source_index` | `int64` | Original row index in the source dataset (0-indexed across all 234 source files concatenated in sorted filename order) | ## Methodology ### Uniform Permutation A single permutation of all N = 190,168,005 row indices is generated using the Fisher-Yates shuffle (also known as the Knuth shuffle). Fisher-Yates is the standard algorithm for generating uniformly random permutations: it produces each of the N! possible orderings with exactly equal probability 1/N!. The permutation assigns every source row a unique output position. From this, each row's destination output file and position within that file are derived deterministically. ### Pseudorandom Number Generator The permutation is generated using NumPy's PCG64 (Permuted Congruential Generator) with a 128-bit state and period of 2^128. To prevent correlation between runs with sequential seeds, the integer seed is hashed through BLAKE2b before being used to initialize the generator. The output is fully deterministic: the same seed always produces the same permutation. ### Two-Pass Shuffle The shuffle is executed in two sequential-I/O passes to avoid random access across the full dataset: **Pass 1 (Scatter)**: The 234 source parquet files are read sequentially. For each row, the precomputed permutation determines which output bucket it belongs to. Rows are buffered by bucket and flushed to intermediate shard files on disk when buffers fill. All I/O is sequential. Multiple workers process source files in parallel, each writing to its own shard files. **Pass 2 (Gather)**: For each of the 381 output buckets, all shard files are read, concatenated, sorted into the permutation-defined order, and written as the final output parquet. Each bucket is independent, making this embarrassingly parallel. This approach requires no random access across the full dataset and uses bounded memory per worker regardless of dataset size. ### Statistical Verification The sampling logic is isolated in a pure module with no I/O or side effects, tested to statistical certainty: - **Positional uniformity**: Chi-squared tests confirm each element is equally likely at each output position (n=12, 600K trials, alpha=0.001). - **Adjacency uniformity**: Chi-squared tests confirm each element is equally likely to follow any other element (n=12, 600K trials, alpha=0.001). - **Full permutation uniformity**: For n=6, all 720 possible permutations appear with equal frequency over 3M trials (chi-squared, alpha=0.001). - **Seed independence**: Spearman rank correlations between permutations from 10K consecutive seed pairs are verified to be near zero. ### Bucket Sizing The 190M rows are distributed across 381 output files (~499K rows each). Bucket size controls the statistical representativeness of each output file. For a bucket of *m* rows, a category with global frequency *p* has relative error `1/sqrt(m*p)` in its within-bucket representation. At ~499K rows per bucket: | Category frequency *p* | Expected count per bucket | Relative error | |---|---|---| | 10% | ~49,900 | 0.45% | | 1% | ~4,990 | 1.4% | | 0.1% | ~499 | 4.5% | | 0.04% | ~200 | 10% | Categories as rare as 0.04% of the dataset have at most ~10% relative error in any single bucket. This means each output file is approximately representative of the global distribution — the file-level ordering is approximately exchangeable. ### Output Verification After the shuffle completes, automated checks confirm: 1. **Row count**: Total rows across all 381 output files equals 190,168,005. 2. **Permutation validity**: All `_source_index` values form a valid permutation of [0, 190168005) with no duplicates or gaps. ## Known Limitations **HuggingFace Data Studio**: The parquet files were written without a page index, which prevents the HuggingFace Data Studio from serving random row previews without loading entire row groups. This does not affect programmatic consumption (PyArrow, pandas, DuckDB, etc.) — only the web-based Data Studio preview. A future re-serialization with `write_page_index=True` and smaller row-group sizes would resolve this. ## License This dataset inherits the [ODC-BY 1.0](https://opendatacommons.org/licenses/by/1-0/) license from [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) via [SmolLM-Corpus](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus).

--- 语言： - 英语许可协议：odc-by 样本规模区间： - 100M < 样本数 < 1B 任务类别： - 文本生成标签： - 预训练 - 打乱 - FineWeb - 教育数据集信息：特征： - 名称：text 数据类型：large_string - 名称：_source_index 数据类型：int64 划分： - 名称：train 样本数：190168005 友好名称：全局打乱版FineWeb-Edu-Dedup --- # 全局打乱版FineWeb-Edu-Dedup 本数据集为HuggingFace推出的SmolLM语料库（SmolLM-Corpus）中FineWeb-Edu-Dedup子集的全局均匀打乱版本。 ## 源数据本数据集源自[HuggingFaceTB/smollm语料库](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus)，具体为其中的`fineweb-edu-dedup`子集。该子集本身源自[FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu)——一个针对教育内容质量筛选并去重后的通用爬虫（Common Crawl）抽取子集。 | 属性 | 取值 | |---|---| | 源数据集 | [`HuggingFaceTB/smollm语料库`](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus) | | 源子集 | `fineweb-edu-dedup` | | 上游来源 | 基于通用爬虫（Common Crawl）的[FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) | | 源文件 | 234个Parquet文件 | | 总行数 | 190,168,005 | | 输出文件 | 381个Parquet文件（单文件约49.9万行） | | 打乱种子 | `42` | | 压缩格式 | zstd | 本数据集的文本内容与源数据**字节级完全一致**，仅对行顺序进行了重排，未进行任何过滤、去重或其他转换操作。每条输出样本均包含`_source_index`列，用于记录该样本在源数据集中的原始位置，以实现全流程可追溯。 ## 设计动机原始的FineWeb-Edu-Dedup Parquet文件按照Common Crawl的爬取批次进行组织，这会产生时间与主题上的聚类效应：连续的样本行往往来自同一爬取批次、同一域名以及相似主题。若按顺序读取这些文件来构建预分词的训练分片，该聚类效应会传递到训练数据中，降低预训练阶段的梯度多样性。本数据集通过对全部1.90亿条样本执行可证明的全局均匀打乱，消除了这种顺序偏置。 ## 数据 Schema | 列名 | 数据类型 | 描述 | |---|---|---| | `text` | `large_string` | 文档文本，与源数据字节级完全一致 | | `_source_index` | `int64` | 源数据集中的原始行索引（按文件名排序后的234个源文件拼接后，索引从0开始计数） | ## 实现方法 ### 均匀置换算法我们使用Fisher-Yates洗牌算法（又称Knuth洗牌算法）生成针对全部N=190,168,005条样本行索引的单次置换。Fisher-Yates算法是生成均匀随机置换的标准算法：它能以完全相等的概率1/N!生成N!种可能的排列顺序中的任意一种。该置换为每条源样本分配唯一的输出位置，由此可确定性地推导出每条样本对应的输出文件及该文件内的位置。 ### 伪随机数生成器本次置换使用NumPy库的PCG64（置换同余生成器）算法，其拥有128位状态空间，周期为2^128。为避免连续种子生成的结果间存在相关性，在使用整数种子初始化生成器前，会先通过BLAKE2b哈希算法对种子进行哈希处理。本次输出完全确定性：相同的种子总能生成相同的置换结果。 ### 两遍洗牌流程为避免对全数据集进行随机读写，本次洗牌采用两遍顺序I/O流程完成： **第一遍（分散阶段）**：按顺序读取234个源Parquet文件。对于每条样本，预计算好的置换结果将决定其归属的输出分片桶。样本按桶进行缓存，当缓存满时将数据刷写到磁盘上的中间分片文件。所有I/O操作均为顺序式。可通过多工作进程并行处理源文件，每个工作进程仅写入自身专属的分片文件。 **第二遍（聚合阶段）**：针对381个输出分片桶中的每一个，读取其所有中间分片文件并拼接，按照置换定义的顺序进行排序后，写入最终的输出Parquet文件。每个分片桶完全独立，该流程可轻松实现并行化。此方案无需对全数据集进行随机读写，且无论数据集规模多大，每个工作进程的内存占用均有界。 ### 统计验证采样逻辑被封装在一个无I/O操作、无副作用的纯模块中，并经过了统计显著性验证： - **位置均匀性**：卡方检验证实，每条样本在每个输出位置上出现的概率均等（共进行12组，每组60万次测试，显著性水平α=0.001）。 - **邻接均匀性**：卡方检验证实，任意两条样本之间的先后出现概率均等（共进行12组，每组60万次测试，显著性水平α=0.001）。 - **全置换均匀性**：在300万次测试中，针对n=6的场景，全部720种可能的置换出现频率均等（卡方检验，α=0.001）。 - **种子独立性**：对1万组连续种子生成的置换结果进行斯皮尔曼秩相关检验，证实相关系数接近零。 ### 分片桶尺寸设定本数据集的1.90亿条样本被分布到381个输出文件中（单文件约49.9万行）。分片桶的尺寸决定了每个输出文件的统计代表性。对于包含*m*条样本的分片桶，全局频率为*p*的类别在该桶内的占比相对误差为`1/sqrt(m*p)`。当单桶约含49.9万行时： | 类别全局频率*p* | 单桶期望样本数 | 相对误差 | |---|---|---| | 10% | ~49,900 | 0.45% | | 1% | ~4,990 | 1.4% | | 0.1% | ~499 | 4.5% | | 0.04% | ~200 | 10% | 即便仅占数据集0.04%的稀有类别，在任意单个分片桶中的相对误差也不超过约10%。这意味着每个输出文件均能近似代表全局数据分布——文件级的样本顺序近似可交换。 ### 输出验证洗牌流程完成后，将通过自动化校验确认以下两点： 1. **行数校验**：381个输出文件的总行数总计为190,168,005。 2. **置换有效性校验**：所有`_source_index`值构成了[0, 190168005)区间的有效置换，无重复或缺失。 ## 已知局限性 **HuggingFace数据工作室（HuggingFace Data Studio）**：本数据集的Parquet文件未写入页面索引，导致HuggingFace Data Studio无法在不加载全行组的情况下提供随机行预览。该问题仅影响基于网页的Data Studio预览功能，不影响程序式消费（如PyArrow、pandas、DuckDB等工具）。未来可通过将`write_page_index=True`并缩小行组尺寸进行重新序列化来解决此问题。 ## 许可协议本数据集的许可协议继承自[FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb)，经由[SmolLM-Corpus](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus)采用[ODC-BY 1.0](https://opendatacommons.org/licenses/by/1-0/)开源许可协议。

提供机构：

JonathanMiddleton

5,000+

优质数据集

54 个

任务类型

进入经典数据集