five

JonathanMiddleton/fineweb-edu-dedup-shuffled-pretokenized

收藏
Hugging Face2026-03-06 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/JonathanMiddleton/fineweb-edu-dedup-shuffled-pretokenized
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: odc-by size_categories: - 100B<n<1T task_categories: - text-generation tags: - pretraining - pretokenized - shuffled - fineweb - education pretty_name: FineWeb-Edu-Dedup Shuffled Pretokenized --- # FineWeb-Edu-Dedup Shuffled Pretokenized Pretokenized training shards built from a globally shuffled version of FineWeb-Edu-Dedup. Ready for direct consumption by the Daisy pretraining loop. ## Summary | Property | Value | |---|---| | Total tokens | 181,465,257,766 (~181.5B) | | Train tokens | ~180.5B | | Val tokens | 1,000,000,000 (1B) | | Train shards | 1,994 | | Val shards | 10 | | Tokens per shard | 100,000,000 (full shards); last shard per worker may be partial | | Documents (train) | 180,185,493 | | Documents (val) | 992,927 | | Tokenizer | [`jonathanmiddleton/daisy`](https://huggingface.co/jonathanmiddleton/daisy) (49,152 vocab, BPE) | | Token dtype | uint16 | | Shard format | v3 (magic=20260114, version=3) | | EOS token ID | 49131 | ## Directory Structure ``` train/ 000000.bin 000001.bin ... 001993.bin val/ 000000.bin 000001.bin ... 000009.bin ``` Each `.bin` file contains a 1024-byte header followed by a flat array of uint16 token IDs. ## Shard Format Each shard file has a fixed 1024-byte header (256 int32 words) followed by the token payload: | Header word | Field | Value | |---|---|---| | 0 | magic | 20260114 | | 1 | version | 3 | | 2 | num_tokens | number of tokens in this shard | | 3 | tokenizer_crc | CRC32 of tokenizer name (stored as uint32 in int32 slot) | | 4 | vocab_size | 49152 | | 5 | eos_id | 49131 | | 6 | dtype_bits | 16 | The token stream is a concatenation of documents separated by EOS tokens: ``` [EOS] [doc1_token1] [doc1_token2] ... [EOS] [doc2_token1] ... ``` Every document begins with an EOS token (ID 49131). Documents may span shard boundaries: a document that doesn't fit entirely in one shard continues at the start of the next shard within the same worker's output. The training data loader treats all shards as a single continuous token stream. ## Provenance This dataset was produced by a two-stage pipeline: ### Stage 1: Global Shuffle (parquet) The 190,168,005 rows of [`HuggingFaceTB/smollm-corpus`](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus) (fineweb-edu-dedup subset) were globally shuffled using a Fisher-Yates permutation with BLAKE2b-seeded PCG64 PRNG (seed=42). The shuffled parquet is published separately at [`JonathanMiddleton/fineweb-edu-dedup-shuffled`](https://huggingface.co/datasets/JonathanMiddleton/fineweb-edu-dedup-shuffled). The shuffle eliminates temporal and topical clustering from the upstream Common Crawl dump ordering, improving gradient diversity during pretraining. ### Stage 2: Pretokenization (this dataset) The shuffled parquet was tokenized using the [`jonathanmiddleton/daisy`](https://huggingface.co/jonathanmiddleton/daisy) tokenizer (49,152 vocab BPE) and written as uint16 binary shards. - **Train/val split**: The first 20 of 381 shuffled parquet files (5%) were reserved for validation. The remaining 361 files were used for training. - **Train shards**: 190 parallel workers drained all 361 train parquet files, producing 1,994 shards (1,803 full shards of 100M tokens + 191 partial final shards). - **Val shards**: 1 worker tokenized the 20 val parquet files, capped at 10 shards (1B tokens). Not all val documents were tokenized due to the shard cap. ### Verification Post-build validation confirmed: - All shard headers are valid (magic, version, tokenizer CRC, payload size). - Sequential shard naming with no gaps. - Train EOS token count (180,185,493) matches the source row count for the 361 train parquet files (180,185,425 rows). The +68 difference is within tolerance (0.00004%), likely from documents whose tokenized content incidentally contains the EOS token ID. ## Usage ### Download ```bash python -m data.download_dataset fineweb-edu-shuffled ``` This downloads to `data/fineweb-edu-shuffled/train/` and `data/fineweb-edu-shuffled/val/`. ### Training Configuration In a Daisy training YAML config: ```yaml train_shards: - type: "fineweb_edu_shuffled" path: "data/fineweb-edu-shuffled/train" sequence_length: 65536 val_shards: - type: "fineweb_edu_shuffled" path: "data/fineweb-edu-shuffled/val" target_tokens: 1_000_000 sequence_length: 65536 ``` The data loader globs `*.bin` from the directory and reads shards sequentially. ### Shard Range Selection To use a subset of shards (e.g., for multi-stage training that avoids data reuse): ```yaml path: "data/fineweb-edu-shuffled/train[000500:001000]" ``` This selects shards 000500.bin through 001000.bin (inclusive), using the range filter supported by the Daisy data loader. ## License This dataset inherits the [ODC-BY 1.0](https://opendatacommons.org/licenses/by/1-0/) license from [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) via [SmolLM-Corpus](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus).

language: - 英语 license: odc-by size_categories: - 1000亿 < 数据规模 < 1万亿 task_categories: - 文本生成 tags: - 预训练(pretraining) - 预分词(pretokenized) - 打乱(shuffled) - FineWeb - 教育(education) pretty_name: FineWeb-Edu-Dedup 经打乱与预分词的去重版本(FineWeb-Edu-Dedup Shuffled Pretokenized) --- # 经过去重、全局打乱与预分词的FineWeb-Edu数据集(FineWeb-Edu-Dedup Shuffled Pretokenized) 本数据集基于FineWeb-Edu-Dedup的全局打乱版本构建预分词训练分片,可直接供Daisy预训练流水线使用。 ## 数据集概览 | 属性 | 数值 | |---|---| | 总Token数 | 181,465,257,766(约1815亿) | | 训练集Token数 | 约1805亿 | | 验证集Token数 | 1,000,000,000(10亿) | | 训练分片数 | 1994 | | 验证分片数 | 10 | | 单分片Token数 | 100,000,000(满分片);每个工作节点的最后一个分片可能为非满分片 | | 训练集文档数 | 180,185,493 | | 验证集文档数 | 992,927 | | 分词器 | [`jonathanmiddleton/daisy`](https://huggingface.co/jonathanmiddleton/daisy)(49152词表,BPE字节对编码分词) | | Token数据类型 | uint16 | | 分片格式 | v3(魔法值=20260114,版本=3) | | EOS Token ID | 49131 | ## 目录结构 train/ 000000.bin 000001.bin ... 001993.bin val/ 000000.bin 000001.bin ... 000009.bin 每个`.bin`文件包含1024字节的文件头,其后为uint16类型的Token ID扁平数组。 ## 分片格式 每个分片文件包含固定的1024字节文件头(256个int32字),其后为Token负载数据: | 头文件字序号 | 字段 | 取值说明 | |---|---|---| | 0 | 魔法值(magic) | 20260114 | | 1 | 版本号(version) | 3 | | 2 | Token总数(num_tokens) | 当前分片内的Token数量 | | 3 | 分词器CRC校验值(tokenizer_crc) | 分词器名称的循环冗余校验(CRC32)值(以uint32类型存储于int32槽位) | | 4 | 词表大小(vocab_size) | 49152 | | 5 | EOS Token ID(eos_id) | 49131 | | 6 | 数据类型比特数(dtype_bits) | 16 | Token流为以EOS Token分隔的文档拼接序列: [EOS] [doc1_token1] [doc1_token2] ... [EOS] [doc2_token1] ... 每个文档均以EOS Token(ID为49131)开头。文档可能跨分片存储:若一个文档无法完全放入单个分片,则会延续至同一工作节点输出的下一个分片起始位置。训练数据加载器会将所有分片视为一条连续的Token流。 ## 数据集来源 本数据集通过两阶段流水线生成: ### 阶段1:全局打乱(Parquet格式) 源自[`HuggingFaceTB/smollm-corpus`](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus)的190,168,005条数据(fineweb-edu-dedup子集),通过基于BLAKE2b种子的PCG64伪随机数生成器(种子=42)执行Fisher-Yates全局置换打乱。打乱后的Parquet数据集已单独发布于[`JonathanMiddleton/fineweb-edu-dedup-shuffled`](https://huggingface.co/datasets/JonathanMiddleton/fineweb-edu-dedup-shuffled)。 该打乱操作消除了上游Common Crawl通用爬虫数据集的转储顺序带来的时间与主题聚类问题,提升了预训练阶段的梯度多样性。 ### 阶段2:预分词(本数据集) 将打乱后的Parquet数据集使用[`jonathanmiddleton/daisy`](https://huggingface.co/jonathanmiddleton/daisy)分词器(49152词表的BPE字节对编码分词器)进行分词,并存储为uint16类型的二进制分片。 - **训练/验证集划分**:从381个打乱后的Parquet文件中,选取前20个(占比5%)作为验证集,剩余361个文件用于训练。 - **训练分片**:190个并行工作节点处理全部361个训练Parquet文件,共生成1994个分片(1803个满分片,每个含1亿Token,外加191个非满的最终分片)。 - **验证分片**:1个工作节点处理20个验证Parquet文件,受限于分片数量上限,最终生成10个分片(共10亿Token),未对全部验证集文档完成分词。 ## 构建验证 数据集构建完成后通过以下方式完成验证: - 所有分片文件头均有效(魔法值、版本号、分词器CRC校验值、负载大小均符合规范)。 - 分片命名连续无间隙。 - 训练集EOS Token计数(180,185,493)与361个训练Parquet文件的源行数(180,185,425)基本匹配,68的差值处于可接受范围内(0.00004%),该差异大概率源自分词后内容中意外包含EOS Token ID的文档。 ## 使用方法 ### 下载 bash python -m data.download_dataset fineweb-edu-shuffled 该命令会将数据集下载至`data/fineweb-edu-shuffled/train/`与`data/fineweb-edu-shuffled/val/`目录。 ### 训练配置 在Daisy训练的YAML配置文件中: yaml train_shards: - type: "fineweb_edu_shuffled" path: "data/fineweb-edu-shuffled/train" sequence_length: 65536 val_shards: - type: "fineweb_edu_shuffled" path: "data/fineweb-edu-shuffled/val" target_tokens: 1_000_000 sequence_length: 65536 数据加载器会对目录下的所有`*.bin`文件进行通配符匹配,并按顺序读取分片。 ### 分片范围选择 若需使用部分分片(例如用于多阶段训练以避免数据重复使用): yaml path: "data/fineweb-edu-shuffled/train[000500:001000]" 该配置会选取000500.bin至001000.bin(含两端)的分片,该范围过滤功能由Daisy数据加载器支持。 ## 许可证 本数据集继承自[FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb)并经由[SmolLM-Corpus](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus)传递的[ODC-BY 1.0](https://opendatacommons.org/licenses/by/1-0/)许可证。
提供机构:
JonathanMiddleton
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作