weights-and-wires/fineweb-6b
收藏Hugging Face2026-01-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/weights-and-wires/fineweb-6b
下载链接
链接失效反馈官方服务:
资源简介:
---
license: odc-by
task_categories:
- text-generation
language:
- en
tags:
- pretraining
- web-data
- fineweb
- text-generation
size_categories:
- 1B<n<10B
---
# FineWeb-6B: First 6B Tokens
A curated subset of the [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) dataset containing the first 6 billion tokens, designed for efficient language model pre-training experiments.
## Dataset Description
This dataset contains high-quality web text data suitable for pre-training small to medium-sized language models. It's particularly useful for researchers and practitioners who want to experiment with LLM pre-training without requiring massive computational resources.
### Dataset Statistics
| Metric | Value |
|--------|-------|
| **Total Tokens** | ~6 billion |
| **Raw Data Size** | 16.1 GB (parquet) |
| **Tokenized Size** | 11.3 GB (train) + 57 MB (val) |
| **Vocabulary Size** | 49,152 |
| **Tokenizer** | Byte-level BPE |
| **Context Length** | 2048 tokens |
## Usage
### Loading the Raw Dataset
```python
from datasets import load_dataset
# Load the parquet file
dataset = load_dataset("weights-and-wires/fineweb-6b")
```
### Loading Pre-tokenized Data
For training, you can use the pre-tokenized binary files which are much faster to load:
```python
import numpy as np
# Load pre-tokenized training data
train_data = np.memmap('tokenized/train.bin', dtype=np.uint16, mode='r')
val_data = np.memmap('tokenized/val.bin', dtype=np.uint16, mode='r')
print(f"Training tokens: {len(train_data):,}")
print(f"Validation tokens: {len(val_data):,}")
```
### Loading the Tokenizer
```python
from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast.from_pretrained(
"weights-and-wires/fineweb-6b",
subfolder="tokenized"
)
# Example usage
text = "The quick brown fox jumps over the lazy dog"
tokens = tokenizer.encode(text)
print(f"Tokens: {tokens}")
print(f"Decoded: {tokenizer.decode(tokens)}")
```
## Dataset Structure
### Files
- **`fineweb-6b.parquet`**: Raw text data in parquet format (default download)
- **`tokenized/train.bin`**: Pre-tokenized training data (uint16 format)
- **`tokenized/val.bin`**: Pre-tokenized validation data (uint16 format)
- **`tokenized/tokenizer.json`**: Tokenizer vocabulary and merges
- **`tokenized/tokenizer_config.json`**: Tokenizer configuration
- **`tokenized/special_tokens_map.json`**: Special tokens mapping
- **`distillation/`**: Knowledge distillation data (see below)
### Distillation Data
The `distillation/` directory contains precomputed teacher logits from [SmolLM2-360M](https://huggingface.co/HuggingFaceTB/SmolLM2-360M) for knowledge distillation:
| File | Description | Size (6B tokens) |
|------|-------------|------------------|
| `metadata.json` | Configuration and vocab info | ~1 KB |
| `train_tokens.bin` | Token IDs (uint16) | ~11.2 GB |
| `train_topk_ids.bin` | Top-128 token indices | ~1.4 GB |
| `train_topk_probs.bin` | Top-128 probabilities (float16) | ~1.4 GB |
| `val_tokens.bin` | Validation token IDs | ~56 MB |
| `val_topk_ids.bin` | Validation top-128 indices | ~7 MB |
| `val_topk_probs.bin` | Validation top-128 probs | ~7 MB |
**Loading distillation data:**
```python
import numpy as np
import json
# Load metadata
with open("distillation/metadata.json") as f:
metadata = json.load(f)
# Load memory-mapped files
tokens = np.memmap("distillation/train_tokens.bin", dtype=np.uint16, mode="r")
topk_ids = np.memmap("distillation/train_topk_ids.bin", dtype=np.uint16, mode="r").reshape(-1, 128)
topk_probs = np.memmap("distillation/train_topk_probs.bin", dtype=np.float16, mode="r").reshape(-1, 128)
print(f"Tokens: {len(tokens):,}")
print(f"Teacher model: {metadata['teacher_model']}")
```
### Data Fields
The parquet file contains:
- `text`: The raw text content
The binary files contain:
- Token IDs as uint16 values (0-49151)
## Training a Model
This dataset was used to train [weights-and-wires/smol-llama](https://huggingface.co/weights-and-wires/smol-llama), a 360M parameter LLaMA-style model. See that repository for training code and details.
### Example Training Loop
```python
import numpy as np
import torch
def get_batch(split='train', batch_size=64, block_size=2048):
data = np.memmap(f'tokenized/{split}.bin', dtype=np.uint16, mode='r')
ix = torch.randint(len(data) - block_size, (batch_size,))
x = torch.stack([torch.from_numpy(data[i:i+block_size].astype(np.int64)) for i in ix])
y = torch.stack([torch.from_numpy(data[i+1:i+1+block_size].astype(np.int64)) for i in ix])
return x.cuda(), y.cuda()
# Training loop
for step in range(num_steps):
x, y = get_batch('train')
logits, loss = model(x, y)
loss.backward()
optimizer.step()
```
## Tokenizer Details
The tokenizer is a byte-level BPE (Byte Pair Encoding) tokenizer with:
- **Vocabulary size**: 49,152 tokens
- **Special tokens**:
- `<|endoftext|>`: End of text marker
- **Encoding**: UTF-8 byte-level
- **Trained on**: A sample of the FineWeb dataset
## Citation
If you use this dataset, please cite the original FineWeb dataset:
```bibtex
@inproceedings{
penedo2024the,
title={The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale},
author={Guilherme Penedo and Hynek Kydl{\'\i}{\v{c}}ek and Loubna Ben allal and Anton Lozhkov and Margaret Mitchell and Colin Raffel and Leandro Von Werra and Thomas Wolf},
booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
year={2024},
url={https://openreview.net/forum?id=n6SCkn2QaG}
}
```
## License
This dataset is released under the [ODC-BY](https://opendatacommons.org/licenses/by/1-0/) license, following the original FineWeb dataset.
## Acknowledgments
- Original dataset: [HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb)
- Pre-training project: [weights-and-wires/smol-llama](https://huggingface.co/weights-and-wires/smol-llama)
提供机构:
weights-and-wires



