five

anisoleai/fineweb-tokenized

收藏
Hugging Face2026-04-01 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/anisoleai/fineweb-tokenized
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: odc-by language: - en source_datasets: - HuggingFaceFW/fineweb configs: - config_name: default data_files: - split: shard path: - data_1/shard-* - data_2/shard-* - data_3/shard-* - data_4/shard-* - data_5/shard-* - data_6/shard-* - data_7/shard-* - data_8/shard-* - data_9/shard-* - data_10/shard-* - data_11/shard-* - data_12/shard-* - data_13/shard-* - data_14/shard-* - data_15/shard-* - data_16/shard-* - data_17/shard-* - data_18/shard-* - data_19/shard-* - data_20/shard-* --- # FineWeb Tokenized Corpus (AnisoleAI) ## Overview This repository provides a large-scale **pre-tokenized version of the FineWeb dataset** designed for efficient training of language models. The dataset contains text from the **FineWeb corpus** that has been tokenized using a **SentencePiece tokenizer**. Tokens are stored in a compact **`uint16` format** for efficient storage and high-throughput training. Each dataset record contains a **flat array of token IDs** representing a continuous tokenized text sequence. This format allows: - fast loading - minimal memory overhead - efficient distributed training - direct compatibility with LLM training pipelines No semantic modifications were made to the original FineWeb dataset. The text was only **tokenized and serialized into shard files**. --- # Dataset Structure The dataset is organized into multiple shard directories: ``` data_1/shard-00000.parquet data_1/shard-00001.parquet ... data_2/shard-00000.parquet ... ... data_20/shard-XXXXX.parquet ``` Each shard contains: ``` token_ids: uint16[] ``` Each record stores a contiguous tokenized segment that can be used directly for model training. --- # Loading the Dataset You can load the dataset using the HuggingFace `datasets` library. ```python from datasets import load_dataset dataset = load_dataset( "anisoleai/fineweb-tokenized", split="shard" ) sample = dataset[0]["token_ids"] print("Number of tokens:", len(sample)) ``` The dataset supports: - streaming - distributed loading - partial downloads --- # Loading the Tokenizer The tokenizer used to generate the corpus is included in this repository. ```python import sentencepiece as spm from huggingface_hub import hf_hub_download model_path = hf_hub_download( repo_id="anisoleai/fineweb-tokenized", filename="tokenizer.model", repo_type="dataset" ) sp = spm.SentencePieceProcessor(model_file=model_path) print("Vocabulary size:", sp.get_piece_size()) print(sp.decode([1, 10, 20, 30])) ``` --- # Intended Use This dataset is intended for: - large language model pretraining - tokenizer benchmarking - distributed LLM training pipelines - academic AI research - commercial AI development The shard-based structure allows scalable multi-worker training pipelines. --- # Source Dataset Original dataset: **FineWeb** https://huggingface.co/datasets/HuggingFaceFW/fineweb FineWeb is a large-scale filtered web corpus designed for training language models. --- # License This dataset follows the license of the original dataset: **Open Data Commons Attribution License (ODC-BY) v1.0** https://opendatacommons.org/licenses/by/1-0/ --- # Attribution If you use this dataset, please attribute: - the creators of the FineWeb dataset - **AnisoleAI** for the tokenization pipeline and dataset preparation --- # Notes - The dataset contains **token IDs only**. - Original raw text is **not included**. - Token IDs correspond to the included **SentencePiece tokenizer**.
提供机构:
anisoleai
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作