JonathanMiddleton/fineweb-edu-dedup-shuffled-pretokenized

Name: JonathanMiddleton/fineweb-edu-dedup-shuffled-pretokenized
Creator: JonathanMiddleton
Published: 2026-03-06 20:06:00
License: 暂无描述

Hugging Face2026-03-06 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/JonathanMiddleton/fineweb-edu-dedup-shuffled-pretokenized

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: odc-by size_categories: - 100B<n<1T task_categories: - text-generation tags: - pretraining - pretokenized - shuffled - fineweb - education pretty_name: FineWeb-Edu-Dedup Shuffled Pretokenized --- # FineWeb-Edu-Dedup Shuffled Pretokenized Pretokenized training shards built from a globally shuffled version of FineWeb-Edu-Dedup. Ready for direct consumption by the Daisy pretraining loop. ## Summary | Property | Value | |---|---| | Total tokens | 181,465,257,766 (~181.5B) | | Train tokens | ~180.5B | | Val tokens | 1,000,000,000 (1B) | | Train shards | 1,994 | | Val shards | 10 | | Tokens per shard | 100,000,000 (full shards); last shard per worker may be partial | | Documents (train) | 180,185,493 | | Documents (val) | 992,927 | | Tokenizer | [`jonathanmiddleton/daisy`](https://huggingface.co/jonathanmiddleton/daisy) (49,152 vocab, BPE) | | Token dtype | uint16 | | Shard format | v3 (magic=20260114, version=3) | | EOS token ID | 49131 | ## Directory Structure ``` train/ 000000.bin 000001.bin ... 001993.bin val/ 000000.bin 000001.bin ... 000009.bin ``` Each `.bin` file contains a 1024-byte header followed by a flat array of uint16 token IDs. ## Shard Format Each shard file has a fixed 1024-byte header (256 int32 words) followed by the token payload: | Header word | Field | Value | |---|---|---| | 0 | magic | 20260114 | | 1 | version | 3 | | 2 | num_tokens | number of tokens in this shard | | 3 | tokenizer_crc | CRC32 of tokenizer name (stored as uint32 in int32 slot) | | 4 | vocab_size | 49152 | | 5 | eos_id | 49131 | | 6 | dtype_bits | 16 | The token stream is a concatenation of documents separated by EOS tokens: ``` [EOS] [doc1_token1] [doc1_token2] ... [EOS] [doc2_token1] ... ``` Every document begins with an EOS token (ID 49131). Documents may span shard boundaries: a document that doesn't fit entirely in one shard continues at the start of the next shard within the same worker's output. The training data loader treats all shards as a single continuous token stream. ## Provenance This dataset was produced by a two-stage pipeline: ### Stage 1: Global Shuffle (parquet) The 190,168,005 rows of [`HuggingFaceTB/smollm-corpus`](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus) (fineweb-edu-dedup subset) were globally shuffled using a Fisher-Yates permutation with BLAKE2b-seeded PCG64 PRNG (seed=42). The shuffled parquet is published separately at [`JonathanMiddleton/fineweb-edu-dedup-shuffled`](https://huggingface.co/datasets/JonathanMiddleton/fineweb-edu-dedup-shuffled). The shuffle eliminates temporal and topical clustering from the upstream Common Crawl dump ordering, improving gradient diversity during pretraining. ### Stage 2: Pretokenization (this dataset) The shuffled parquet was tokenized using the [`jonathanmiddleton/daisy`](https://huggingface.co/jonathanmiddleton/daisy) tokenizer (49,152 vocab BPE) and written as uint16 binary shards. - **Train/val split**: The first 20 of 381 shuffled parquet files (5%) were reserved for validation. The remaining 361 files were used for training. - **Train shards**: 190 parallel workers drained all 361 train parquet files, producing 1,994 shards (1,803 full shards of 100M tokens + 191 partial final shards). - **Val shards**: 1 worker tokenized the 20 val parquet files, capped at 10 shards (1B tokens). Not all val documents were tokenized due to the shard cap. ### Verification Post-build validation confirmed: - All shard headers are valid (magic, version, tokenizer CRC, payload size). - Sequential shard naming with no gaps. - Train EOS token count (180,185,493) matches the source row count for the 361 train parquet files (180,185,425 rows). The +68 difference is within tolerance (0.00004%), likely from documents whose tokenized content incidentally contains the EOS token ID. ## Usage ### Download ```bash python -m data.download_dataset fineweb-edu-shuffled ``` This downloads to `data/fineweb-edu-shuffled/train/` and `data/fineweb-edu-shuffled/val/`. ### Training Configuration In a Daisy training YAML config: ```yaml train_shards: - type: "fineweb_edu_shuffled" path: "data/fineweb-edu-shuffled/train" sequence_length: 65536 val_shards: - type: "fineweb_edu_shuffled" path: "data/fineweb-edu-shuffled/val" target_tokens: 1_000_000 sequence_length: 65536 ``` The data loader globs `*.bin` from the directory and reads shards sequentially. ### Shard Range Selection To use a subset of shards (e.g., for multi-stage training that avoids data reuse): ```yaml path: "data/fineweb-edu-shuffled/train[000500:001000]" ``` This selects shards 000500.bin through 001000.bin (inclusive), using the range filter supported by the Daisy data loader. ## License This dataset inherits the [ODC-BY 1.0](https://opendatacommons.org/licenses/by/1-0/) license from [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) via [SmolLM-Corpus](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus).

language: - 英语 license: odc-by size_categories: - 1000亿 < 数据规模 < 1万亿 task_categories: - 文本生成 tags: - 预训练（pretraining） - 预分词（pretokenized） - 打乱（shuffled） - FineWeb - 教育（education） pretty_name: FineWeb-Edu-Dedup 经打乱与预分词的去重版本（FineWeb-Edu-Dedup Shuffled Pretokenized） --- # 经过去重、全局打乱与预分词的FineWeb-Edu数据集（FineWeb-Edu-Dedup Shuffled Pretokenized）本数据集基于FineWeb-Edu-Dedup的全局打乱版本构建预分词训练分片，可直接供Daisy预训练流水线使用。 ## 数据集概览 | 属性 | 数值 | |---|---| | 总Token数 | 181,465,257,766（约1815亿） | | 训练集Token数 | 约1805亿 | | 验证集Token数 | 1,000,000,000（10亿） | | 训练分片数 | 1994 | | 验证分片数 | 10 | | 单分片Token数 | 100,000,000（满分片）；每个工作节点的最后一个分片可能为非满分片 | | 训练集文档数 | 180,185,493 | | 验证集文档数 | 992,927 | | 分词器 | [`jonathanmiddleton/daisy`](https://huggingface.co/jonathanmiddleton/daisy)（49152词表，BPE字节对编码分词） | | Token数据类型 | uint16 | | 分片格式 | v3（魔法值=20260114，版本=3） | | EOS Token ID | 49131 | ## 目录结构 train/ 000000.bin 000001.bin ... 001993.bin val/ 000000.bin 000001.bin ... 000009.bin 每个`.bin`文件包含1024字节的文件头，其后为uint16类型的Token ID扁平数组。 ## 分片格式每个分片文件包含固定的1024字节文件头（256个int32字），其后为Token负载数据： | 头文件字序号 | 字段 | 取值说明 | |---|---|---| | 0 | 魔法值（magic） | 20260114 | | 1 | 版本号（version） | 3 | | 2 | Token总数（num_tokens） | 当前分片内的Token数量 | | 3 | 分词器CRC校验值（tokenizer_crc） | 分词器名称的循环冗余校验（CRC32）值（以uint32类型存储于int32槽位） | | 4 | 词表大小（vocab_size） | 49152 | | 5 | EOS Token ID（eos_id） | 49131 | | 6 | 数据类型比特数（dtype_bits） | 16 | Token流为以EOS Token分隔的文档拼接序列： [EOS] [doc1_token1] [doc1_token2] ... [EOS] [doc2_token1] ... 每个文档均以EOS Token（ID为49131）开头。文档可能跨分片存储：若一个文档无法完全放入单个分片，则会延续至同一工作节点输出的下一个分片起始位置。训练数据加载器会将所有分片视为一条连续的Token流。 ## 数据集来源本数据集通过两阶段流水线生成： ### 阶段1：全局打乱（Parquet格式）源自[`HuggingFaceTB/smollm-corpus`](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus)的190,168,005条数据（fineweb-edu-dedup子集），通过基于BLAKE2b种子的PCG64伪随机数生成器（种子=42）执行Fisher-Yates全局置换打乱。打乱后的Parquet数据集已单独发布于[`JonathanMiddleton/fineweb-edu-dedup-shuffled`](https://huggingface.co/datasets/JonathanMiddleton/fineweb-edu-dedup-shuffled)。该打乱操作消除了上游Common Crawl通用爬虫数据集的转储顺序带来的时间与主题聚类问题，提升了预训练阶段的梯度多样性。 ### 阶段2：预分词（本数据集）将打乱后的Parquet数据集使用[`jonathanmiddleton/daisy`](https://huggingface.co/jonathanmiddleton/daisy)分词器（49152词表的BPE字节对编码分词器）进行分词，并存储为uint16类型的二进制分片。 - **训练/验证集划分**：从381个打乱后的Parquet文件中，选取前20个（占比5%）作为验证集，剩余361个文件用于训练。 - **训练分片**：190个并行工作节点处理全部361个训练Parquet文件，共生成1994个分片（1803个满分片，每个含1亿Token，外加191个非满的最终分片）。 - **验证分片**：1个工作节点处理20个验证Parquet文件，受限于分片数量上限，最终生成10个分片（共10亿Token），未对全部验证集文档完成分词。 ## 构建验证数据集构建完成后通过以下方式完成验证： - 所有分片文件头均有效（魔法值、版本号、分词器CRC校验值、负载大小均符合规范）。 - 分片命名连续无间隙。 - 训练集EOS Token计数（180,185,493）与361个训练Parquet文件的源行数（180,185,425）基本匹配，68的差值处于可接受范围内（0.00004%），该差异大概率源自分词后内容中意外包含EOS Token ID的文档。 ## 使用方法 ### 下载 bash python -m data.download_dataset fineweb-edu-shuffled 该命令会将数据集下载至`data/fineweb-edu-shuffled/train/`与`data/fineweb-edu-shuffled/val/`目录。 ### 训练配置在Daisy训练的YAML配置文件中： yaml train_shards: - type: "fineweb_edu_shuffled" path: "data/fineweb-edu-shuffled/train" sequence_length: 65536 val_shards: - type: "fineweb_edu_shuffled" path: "data/fineweb-edu-shuffled/val" target_tokens: 1_000_000 sequence_length: 65536 数据加载器会对目录下的所有`*.bin`文件进行通配符匹配，并按顺序读取分片。 ### 分片范围选择若需使用部分分片（例如用于多阶段训练以避免数据重复使用）： yaml path: "data/fineweb-edu-shuffled/train[000500:001000]" 该配置会选取000500.bin至001000.bin（含两端）的分片，该范围过滤功能由Daisy数据加载器支持。 ## 许可证本数据集继承自[FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb)并经由[SmolLM-Corpus](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus)传递的[ODC-BY 1.0](https://opendatacommons.org/licenses/by/1-0/)许可证。

提供机构：

JonathanMiddleton

5,000+

优质数据集

54 个

任务类型

进入经典数据集