five

nicholasbailey87/parameter-golf-byte260

收藏
Hugging Face2026-04-04 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/nicholasbailey87/parameter-golf-byte260
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - text-generation language: - en pretty_name: Parameter Golf FineWeb 10B - Byte260 Tokenization size_categories: - 10B<n<100B --- # Parameter Golf FineWeb 10B — Byte260 Tokenization Byte-level tokenized version of the [FineWeb 10B](https://huggingface.co/datasets/willdepueoai/parameter-golf) dataset for the [Parameter Golf](https://github.com/KellerJordan/modded-nanogpt) competition. The HuggingFace repo (willdepueoai/parameter-golf) only currently provides fineweb10B_sp1024 in both the manifest and the dataset folders, so I whipped this up instead. It seems to work OK, but use at your own risk as it's vibe coded! ## Tokenizer **byte260**: 256 UTF-8 byte tokens + 4 special tokens (pad=0, bos=1, eos=2, unk=3). Byte values are offset by 4, so token ID = byte_value + 4. - Vocab size: 260 - Each token represents exactly 1 byte ## Dataset Stats | Split | Documents | Tokens | Shards | |-------|-----------|--------|--------| | Train | 15,318,808 | 47,571,635,348 | 476 | | Val | 50,000 | 151,130,645 | 2 | | **Total** | **15,368,808** | **47,722,765,993** | **478** | ## File Format Binary shards (uint16, little-endian) with a 1024-byte header: - Header[0]: magic = 20240520 - Header[1]: version = 1 - Header[2]: number of tokens in shard - Followed by token data as uint16 Each shard contains ~100M tokens (last shard may be smaller). ## Usage Download with the Parameter Golf data loader: ```bash python data/cached_challenge_fineweb.py --variant byte260 --train-shards 80 ``` Or set the environment variables to point at this repo: ```bash export MATCHED_FINEWEB_REPO_ID=nicholasbailey87/parameter-golf-byte260 export MATCHED_FINEWEB_REMOTE_ROOT_PREFIX="" python data/cached_challenge_fineweb.py --variant byte260 --train-shards 80 ``` ## Source Generated from [willdepueoai/parameter-golf](https://huggingface.co/datasets/willdepueoai/parameter-golf) docs_selected.jsonl using `data/download_hf_docs_and_tokenize.py` with a pure-byte tokenizer config.
提供机构:
nicholasbailey87
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作