nicholasbailey87/parameter-golf-byte260
收藏Hugging Face2026-04-04 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/nicholasbailey87/parameter-golf-byte260
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- text-generation
language:
- en
pretty_name: Parameter Golf FineWeb 10B - Byte260 Tokenization
size_categories:
- 10B<n<100B
---
# Parameter Golf FineWeb 10B — Byte260 Tokenization
Byte-level tokenized version of the [FineWeb 10B](https://huggingface.co/datasets/willdepueoai/parameter-golf) dataset for the [Parameter Golf](https://github.com/KellerJordan/modded-nanogpt) competition.
The HuggingFace repo (willdepueoai/parameter-golf) only currently provides fineweb10B_sp1024 in both the manifest and the dataset folders, so I whipped this up instead.
It seems to work OK, but use at your own risk as it's vibe coded!
## Tokenizer
**byte260**: 256 UTF-8 byte tokens + 4 special tokens (pad=0, bos=1, eos=2, unk=3). Byte values are offset by 4, so token ID = byte_value + 4.
- Vocab size: 260
- Each token represents exactly 1 byte
## Dataset Stats
| Split | Documents | Tokens | Shards |
|-------|-----------|--------|--------|
| Train | 15,318,808 | 47,571,635,348 | 476 |
| Val | 50,000 | 151,130,645 | 2 |
| **Total** | **15,368,808** | **47,722,765,993** | **478** |
## File Format
Binary shards (uint16, little-endian) with a 1024-byte header:
- Header[0]: magic = 20240520
- Header[1]: version = 1
- Header[2]: number of tokens in shard
- Followed by token data as uint16
Each shard contains ~100M tokens (last shard may be smaller).
## Usage
Download with the Parameter Golf data loader:
```bash
python data/cached_challenge_fineweb.py --variant byte260 --train-shards 80
```
Or set the environment variables to point at this repo:
```bash
export MATCHED_FINEWEB_REPO_ID=nicholasbailey87/parameter-golf-byte260
export MATCHED_FINEWEB_REMOTE_ROOT_PREFIX=""
python data/cached_challenge_fineweb.py --variant byte260 --train-shards 80
```
## Source
Generated from [willdepueoai/parameter-golf](https://huggingface.co/datasets/willdepueoai/parameter-golf) docs_selected.jsonl using `data/download_hf_docs_and_tokenize.py` with a pure-byte tokenizer config.
提供机构:
nicholasbailey87



