anisoleai/fineweb-tokenized
收藏Hugging Face2026-04-01 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/anisoleai/fineweb-tokenized
下载链接
链接失效反馈官方服务:
资源简介:
---
license: odc-by
language:
- en
source_datasets:
- HuggingFaceFW/fineweb
configs:
- config_name: default
data_files:
- split: shard
path:
- data_1/shard-*
- data_2/shard-*
- data_3/shard-*
- data_4/shard-*
- data_5/shard-*
- data_6/shard-*
- data_7/shard-*
- data_8/shard-*
- data_9/shard-*
- data_10/shard-*
- data_11/shard-*
- data_12/shard-*
- data_13/shard-*
- data_14/shard-*
- data_15/shard-*
- data_16/shard-*
- data_17/shard-*
- data_18/shard-*
- data_19/shard-*
- data_20/shard-*
---
# FineWeb Tokenized Corpus (AnisoleAI)
## Overview
This repository provides a large-scale **pre-tokenized version of the FineWeb dataset** designed for efficient training of language models.
The dataset contains text from the **FineWeb corpus** that has been tokenized using a **SentencePiece tokenizer**.
Tokens are stored in a compact **`uint16` format** for efficient storage and high-throughput training.
Each dataset record contains a **flat array of token IDs** representing a continuous tokenized text sequence.
This format allows:
- fast loading
- minimal memory overhead
- efficient distributed training
- direct compatibility with LLM training pipelines
No semantic modifications were made to the original FineWeb dataset.
The text was only **tokenized and serialized into shard files**.
---
# Dataset Structure
The dataset is organized into multiple shard directories:
```
data_1/shard-00000.parquet
data_1/shard-00001.parquet
...
data_2/shard-00000.parquet
...
...
data_20/shard-XXXXX.parquet
```
Each shard contains:
```
token_ids: uint16[]
```
Each record stores a contiguous tokenized segment that can be used directly for model training.
---
# Loading the Dataset
You can load the dataset using the HuggingFace `datasets` library.
```python
from datasets import load_dataset
dataset = load_dataset(
"anisoleai/fineweb-tokenized",
split="shard"
)
sample = dataset[0]["token_ids"]
print("Number of tokens:", len(sample))
```
The dataset supports:
- streaming
- distributed loading
- partial downloads
---
# Loading the Tokenizer
The tokenizer used to generate the corpus is included in this repository.
```python
import sentencepiece as spm
from huggingface_hub import hf_hub_download
model_path = hf_hub_download(
repo_id="anisoleai/fineweb-tokenized",
filename="tokenizer.model",
repo_type="dataset"
)
sp = spm.SentencePieceProcessor(model_file=model_path)
print("Vocabulary size:", sp.get_piece_size())
print(sp.decode([1, 10, 20, 30]))
```
---
# Intended Use
This dataset is intended for:
- large language model pretraining
- tokenizer benchmarking
- distributed LLM training pipelines
- academic AI research
- commercial AI development
The shard-based structure allows scalable multi-worker training pipelines.
---
# Source Dataset
Original dataset:
**FineWeb**
https://huggingface.co/datasets/HuggingFaceFW/fineweb
FineWeb is a large-scale filtered web corpus designed for training language models.
---
# License
This dataset follows the license of the original dataset:
**Open Data Commons Attribution License (ODC-BY) v1.0**
https://opendatacommons.org/licenses/by/1-0/
---
# Attribution
If you use this dataset, please attribute:
- the creators of the FineWeb dataset
- **AnisoleAI** for the tokenization pipeline and dataset preparation
---
# Notes
- The dataset contains **token IDs only**.
- Original raw text is **not included**.
- Token IDs correspond to the included **SentencePiece tokenizer**.
提供机构:
anisoleai



