JonathanMiddleton/fineweb-edu-dedup-shuffled-pretokenized
收藏Hugging Face2026-03-06 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/JonathanMiddleton/fineweb-edu-dedup-shuffled-pretokenized
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: odc-by
size_categories:
- 100B<n<1T
task_categories:
- text-generation
tags:
- pretraining
- pretokenized
- shuffled
- fineweb
- education
pretty_name: FineWeb-Edu-Dedup Shuffled Pretokenized
---
# FineWeb-Edu-Dedup Shuffled Pretokenized
Pretokenized training shards built from a globally shuffled version of FineWeb-Edu-Dedup.
Ready for direct consumption by the Daisy pretraining loop.
## Summary
| Property | Value |
|---|---|
| Total tokens | 181,465,257,766 (~181.5B) |
| Train tokens | ~180.5B |
| Val tokens | 1,000,000,000 (1B) |
| Train shards | 1,994 |
| Val shards | 10 |
| Tokens per shard | 100,000,000 (full shards); last shard per worker may be partial |
| Documents (train) | 180,185,493 |
| Documents (val) | 992,927 |
| Tokenizer | [`jonathanmiddleton/daisy`](https://huggingface.co/jonathanmiddleton/daisy) (49,152 vocab, BPE) |
| Token dtype | uint16 |
| Shard format | v3 (magic=20260114, version=3) |
| EOS token ID | 49131 |
## Directory Structure
```
train/
000000.bin
000001.bin
...
001993.bin
val/
000000.bin
000001.bin
...
000009.bin
```
Each `.bin` file contains a 1024-byte header followed by a flat array of uint16 token IDs.
## Shard Format
Each shard file has a fixed 1024-byte header (256 int32 words) followed by the token payload:
| Header word | Field | Value |
|---|---|---|
| 0 | magic | 20260114 |
| 1 | version | 3 |
| 2 | num_tokens | number of tokens in this shard |
| 3 | tokenizer_crc | CRC32 of tokenizer name (stored as uint32 in int32 slot) |
| 4 | vocab_size | 49152 |
| 5 | eos_id | 49131 |
| 6 | dtype_bits | 16 |
The token stream is a concatenation of documents separated by EOS tokens:
```
[EOS] [doc1_token1] [doc1_token2] ... [EOS] [doc2_token1] ...
```
Every document begins with an EOS token (ID 49131). Documents may span shard boundaries:
a document that doesn't fit entirely in one shard continues at the start of the next shard
within the same worker's output. The training data loader treats all shards as a single
continuous token stream.
## Provenance
This dataset was produced by a two-stage pipeline:
### Stage 1: Global Shuffle (parquet)
The 190,168,005 rows of [`HuggingFaceTB/smollm-corpus`](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus)
(fineweb-edu-dedup subset) were globally shuffled using a Fisher-Yates permutation with
BLAKE2b-seeded PCG64 PRNG (seed=42). The shuffled parquet is published separately at
[`JonathanMiddleton/fineweb-edu-dedup-shuffled`](https://huggingface.co/datasets/JonathanMiddleton/fineweb-edu-dedup-shuffled).
The shuffle eliminates temporal and topical clustering from the upstream Common Crawl dump
ordering, improving gradient diversity during pretraining.
### Stage 2: Pretokenization (this dataset)
The shuffled parquet was tokenized using the
[`jonathanmiddleton/daisy`](https://huggingface.co/jonathanmiddleton/daisy) tokenizer
(49,152 vocab BPE) and written as uint16 binary shards.
- **Train/val split**: The first 20 of 381 shuffled parquet files (5%) were reserved for
validation. The remaining 361 files were used for training.
- **Train shards**: 190 parallel workers drained all 361 train parquet files, producing
1,994 shards (1,803 full shards of 100M tokens + 191 partial final shards).
- **Val shards**: 1 worker tokenized the 20 val parquet files, capped at 10 shards (1B tokens).
Not all val documents were tokenized due to the shard cap.
### Verification
Post-build validation confirmed:
- All shard headers are valid (magic, version, tokenizer CRC, payload size).
- Sequential shard naming with no gaps.
- Train EOS token count (180,185,493) matches the source row count for the 361 train
parquet files (180,185,425 rows). The +68 difference is within tolerance (0.00004%),
likely from documents whose tokenized content incidentally contains the EOS token ID.
## Usage
### Download
```bash
python -m data.download_dataset fineweb-edu-shuffled
```
This downloads to `data/fineweb-edu-shuffled/train/` and `data/fineweb-edu-shuffled/val/`.
### Training Configuration
In a Daisy training YAML config:
```yaml
train_shards:
- type: "fineweb_edu_shuffled"
path: "data/fineweb-edu-shuffled/train"
sequence_length: 65536
val_shards:
- type: "fineweb_edu_shuffled"
path: "data/fineweb-edu-shuffled/val"
target_tokens: 1_000_000
sequence_length: 65536
```
The data loader globs `*.bin` from the directory and reads shards sequentially.
### Shard Range Selection
To use a subset of shards (e.g., for multi-stage training that avoids data reuse):
```yaml
path: "data/fineweb-edu-shuffled/train[000500:001000]"
```
This selects shards 000500.bin through 001000.bin (inclusive), using the range filter
supported by the Daisy data loader.
## License
This dataset inherits the [ODC-BY 1.0](https://opendatacommons.org/licenses/by/1-0/) license
from [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) via
[SmolLM-Corpus](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus).
language:
- 英语
license: odc-by
size_categories:
- 1000亿 < 数据规模 < 1万亿
task_categories:
- 文本生成
tags:
- 预训练(pretraining)
- 预分词(pretokenized)
- 打乱(shuffled)
- FineWeb
- 教育(education)
pretty_name: FineWeb-Edu-Dedup 经打乱与预分词的去重版本(FineWeb-Edu-Dedup Shuffled Pretokenized)
---
# 经过去重、全局打乱与预分词的FineWeb-Edu数据集(FineWeb-Edu-Dedup Shuffled Pretokenized)
本数据集基于FineWeb-Edu-Dedup的全局打乱版本构建预分词训练分片,可直接供Daisy预训练流水线使用。
## 数据集概览
| 属性 | 数值 |
|---|---|
| 总Token数 | 181,465,257,766(约1815亿) |
| 训练集Token数 | 约1805亿 |
| 验证集Token数 | 1,000,000,000(10亿) |
| 训练分片数 | 1994 |
| 验证分片数 | 10 |
| 单分片Token数 | 100,000,000(满分片);每个工作节点的最后一个分片可能为非满分片 |
| 训练集文档数 | 180,185,493 |
| 验证集文档数 | 992,927 |
| 分词器 | [`jonathanmiddleton/daisy`](https://huggingface.co/jonathanmiddleton/daisy)(49152词表,BPE字节对编码分词) |
| Token数据类型 | uint16 |
| 分片格式 | v3(魔法值=20260114,版本=3) |
| EOS Token ID | 49131 |
## 目录结构
train/
000000.bin
000001.bin
...
001993.bin
val/
000000.bin
000001.bin
...
000009.bin
每个`.bin`文件包含1024字节的文件头,其后为uint16类型的Token ID扁平数组。
## 分片格式
每个分片文件包含固定的1024字节文件头(256个int32字),其后为Token负载数据:
| 头文件字序号 | 字段 | 取值说明 |
|---|---|---|
| 0 | 魔法值(magic) | 20260114 |
| 1 | 版本号(version) | 3 |
| 2 | Token总数(num_tokens) | 当前分片内的Token数量 |
| 3 | 分词器CRC校验值(tokenizer_crc) | 分词器名称的循环冗余校验(CRC32)值(以uint32类型存储于int32槽位) |
| 4 | 词表大小(vocab_size) | 49152 |
| 5 | EOS Token ID(eos_id) | 49131 |
| 6 | 数据类型比特数(dtype_bits) | 16 |
Token流为以EOS Token分隔的文档拼接序列:
[EOS] [doc1_token1] [doc1_token2] ... [EOS] [doc2_token1] ...
每个文档均以EOS Token(ID为49131)开头。文档可能跨分片存储:若一个文档无法完全放入单个分片,则会延续至同一工作节点输出的下一个分片起始位置。训练数据加载器会将所有分片视为一条连续的Token流。
## 数据集来源
本数据集通过两阶段流水线生成:
### 阶段1:全局打乱(Parquet格式)
源自[`HuggingFaceTB/smollm-corpus`](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus)的190,168,005条数据(fineweb-edu-dedup子集),通过基于BLAKE2b种子的PCG64伪随机数生成器(种子=42)执行Fisher-Yates全局置换打乱。打乱后的Parquet数据集已单独发布于[`JonathanMiddleton/fineweb-edu-dedup-shuffled`](https://huggingface.co/datasets/JonathanMiddleton/fineweb-edu-dedup-shuffled)。
该打乱操作消除了上游Common Crawl通用爬虫数据集的转储顺序带来的时间与主题聚类问题,提升了预训练阶段的梯度多样性。
### 阶段2:预分词(本数据集)
将打乱后的Parquet数据集使用[`jonathanmiddleton/daisy`](https://huggingface.co/jonathanmiddleton/daisy)分词器(49152词表的BPE字节对编码分词器)进行分词,并存储为uint16类型的二进制分片。
- **训练/验证集划分**:从381个打乱后的Parquet文件中,选取前20个(占比5%)作为验证集,剩余361个文件用于训练。
- **训练分片**:190个并行工作节点处理全部361个训练Parquet文件,共生成1994个分片(1803个满分片,每个含1亿Token,外加191个非满的最终分片)。
- **验证分片**:1个工作节点处理20个验证Parquet文件,受限于分片数量上限,最终生成10个分片(共10亿Token),未对全部验证集文档完成分词。
## 构建验证
数据集构建完成后通过以下方式完成验证:
- 所有分片文件头均有效(魔法值、版本号、分词器CRC校验值、负载大小均符合规范)。
- 分片命名连续无间隙。
- 训练集EOS Token计数(180,185,493)与361个训练Parquet文件的源行数(180,185,425)基本匹配,68的差值处于可接受范围内(0.00004%),该差异大概率源自分词后内容中意外包含EOS Token ID的文档。
## 使用方法
### 下载
bash
python -m data.download_dataset fineweb-edu-shuffled
该命令会将数据集下载至`data/fineweb-edu-shuffled/train/`与`data/fineweb-edu-shuffled/val/`目录。
### 训练配置
在Daisy训练的YAML配置文件中:
yaml
train_shards:
- type: "fineweb_edu_shuffled"
path: "data/fineweb-edu-shuffled/train"
sequence_length: 65536
val_shards:
- type: "fineweb_edu_shuffled"
path: "data/fineweb-edu-shuffled/val"
target_tokens: 1_000_000
sequence_length: 65536
数据加载器会对目录下的所有`*.bin`文件进行通配符匹配,并按顺序读取分片。
### 分片范围选择
若需使用部分分片(例如用于多阶段训练以避免数据重复使用):
yaml
path: "data/fineweb-edu-shuffled/train[000500:001000]"
该配置会选取000500.bin至001000.bin(含两端)的分片,该范围过滤功能由Daisy数据加载器支持。
## 许可证
本数据集继承自[FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb)并经由[SmolLM-Corpus](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus)传递的[ODC-BY 1.0](https://opendatacommons.org/licenses/by/1-0/)许可证。
提供机构:
JonathanMiddleton



