five

weights-and-wires/fineweb-6b

收藏
Hugging Face2026-01-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/weights-and-wires/fineweb-6b
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: odc-by task_categories: - text-generation language: - en tags: - pretraining - web-data - fineweb - text-generation size_categories: - 1B<n<10B --- # FineWeb-6B: First 6B Tokens A curated subset of the [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) dataset containing the first 6 billion tokens, designed for efficient language model pre-training experiments. ## Dataset Description This dataset contains high-quality web text data suitable for pre-training small to medium-sized language models. It's particularly useful for researchers and practitioners who want to experiment with LLM pre-training without requiring massive computational resources. ### Dataset Statistics | Metric | Value | |--------|-------| | **Total Tokens** | ~6 billion | | **Raw Data Size** | 16.1 GB (parquet) | | **Tokenized Size** | 11.3 GB (train) + 57 MB (val) | | **Vocabulary Size** | 49,152 | | **Tokenizer** | Byte-level BPE | | **Context Length** | 2048 tokens | ## Usage ### Loading the Raw Dataset ```python from datasets import load_dataset # Load the parquet file dataset = load_dataset("weights-and-wires/fineweb-6b") ``` ### Loading Pre-tokenized Data For training, you can use the pre-tokenized binary files which are much faster to load: ```python import numpy as np # Load pre-tokenized training data train_data = np.memmap('tokenized/train.bin', dtype=np.uint16, mode='r') val_data = np.memmap('tokenized/val.bin', dtype=np.uint16, mode='r') print(f"Training tokens: {len(train_data):,}") print(f"Validation tokens: {len(val_data):,}") ``` ### Loading the Tokenizer ```python from transformers import PreTrainedTokenizerFast tokenizer = PreTrainedTokenizerFast.from_pretrained( "weights-and-wires/fineweb-6b", subfolder="tokenized" ) # Example usage text = "The quick brown fox jumps over the lazy dog" tokens = tokenizer.encode(text) print(f"Tokens: {tokens}") print(f"Decoded: {tokenizer.decode(tokens)}") ``` ## Dataset Structure ### Files - **`fineweb-6b.parquet`**: Raw text data in parquet format (default download) - **`tokenized/train.bin`**: Pre-tokenized training data (uint16 format) - **`tokenized/val.bin`**: Pre-tokenized validation data (uint16 format) - **`tokenized/tokenizer.json`**: Tokenizer vocabulary and merges - **`tokenized/tokenizer_config.json`**: Tokenizer configuration - **`tokenized/special_tokens_map.json`**: Special tokens mapping - **`distillation/`**: Knowledge distillation data (see below) ### Distillation Data The `distillation/` directory contains precomputed teacher logits from [SmolLM2-360M](https://huggingface.co/HuggingFaceTB/SmolLM2-360M) for knowledge distillation: | File | Description | Size (6B tokens) | |------|-------------|------------------| | `metadata.json` | Configuration and vocab info | ~1 KB | | `train_tokens.bin` | Token IDs (uint16) | ~11.2 GB | | `train_topk_ids.bin` | Top-128 token indices | ~1.4 GB | | `train_topk_probs.bin` | Top-128 probabilities (float16) | ~1.4 GB | | `val_tokens.bin` | Validation token IDs | ~56 MB | | `val_topk_ids.bin` | Validation top-128 indices | ~7 MB | | `val_topk_probs.bin` | Validation top-128 probs | ~7 MB | **Loading distillation data:** ```python import numpy as np import json # Load metadata with open("distillation/metadata.json") as f: metadata = json.load(f) # Load memory-mapped files tokens = np.memmap("distillation/train_tokens.bin", dtype=np.uint16, mode="r") topk_ids = np.memmap("distillation/train_topk_ids.bin", dtype=np.uint16, mode="r").reshape(-1, 128) topk_probs = np.memmap("distillation/train_topk_probs.bin", dtype=np.float16, mode="r").reshape(-1, 128) print(f"Tokens: {len(tokens):,}") print(f"Teacher model: {metadata['teacher_model']}") ``` ### Data Fields The parquet file contains: - `text`: The raw text content The binary files contain: - Token IDs as uint16 values (0-49151) ## Training a Model This dataset was used to train [weights-and-wires/smol-llama](https://huggingface.co/weights-and-wires/smol-llama), a 360M parameter LLaMA-style model. See that repository for training code and details. ### Example Training Loop ```python import numpy as np import torch def get_batch(split='train', batch_size=64, block_size=2048): data = np.memmap(f'tokenized/{split}.bin', dtype=np.uint16, mode='r') ix = torch.randint(len(data) - block_size, (batch_size,)) x = torch.stack([torch.from_numpy(data[i:i+block_size].astype(np.int64)) for i in ix]) y = torch.stack([torch.from_numpy(data[i+1:i+1+block_size].astype(np.int64)) for i in ix]) return x.cuda(), y.cuda() # Training loop for step in range(num_steps): x, y = get_batch('train') logits, loss = model(x, y) loss.backward() optimizer.step() ``` ## Tokenizer Details The tokenizer is a byte-level BPE (Byte Pair Encoding) tokenizer with: - **Vocabulary size**: 49,152 tokens - **Special tokens**: - `<|endoftext|>`: End of text marker - **Encoding**: UTF-8 byte-level - **Trained on**: A sample of the FineWeb dataset ## Citation If you use this dataset, please cite the original FineWeb dataset: ```bibtex @inproceedings{ penedo2024the, title={The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale}, author={Guilherme Penedo and Hynek Kydl{\'\i}{\v{c}}ek and Loubna Ben allal and Anton Lozhkov and Margaret Mitchell and Colin Raffel and Leandro Von Werra and Thomas Wolf}, booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track}, year={2024}, url={https://openreview.net/forum?id=n6SCkn2QaG} } ``` ## License This dataset is released under the [ODC-BY](https://opendatacommons.org/licenses/by/1-0/) license, following the original FineWeb dataset. ## Acknowledgments - Original dataset: [HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) - Pre-training project: [weights-and-wires/smol-llama](https://huggingface.co/weights-and-wires/smol-llama)
提供机构:
weights-and-wires
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作