Natooka/parameter-golf-sp-tokenizers

Name: Natooka/parameter-golf-sp-tokenizers
Creator: Natooka
Published: 2026-04-17 23:01:19
License: 暂无描述

Hugging Face2026-04-17 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/Natooka/parameter-golf-sp-tokenizers

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 language: - en tags: - sentencepiece - tokenizer - fineweb - parameter-golf - bpe - tokenized-corpus size_categories: - 10B<n<100B --- # Parameter Golf SP16384 — Tokenizer + Tokenized FineWeb-10B Shards SentencePiece BPE tokenizer (`vocab_size=16384`, `byte_fallback=True`) + the full FineWeb-10B corpus pre-tokenized with it. Companion artifact to the chaoscontrol submission pipeline; published to make submission-day setup frictionless — no corpus download, no re-tokenization. ## Files ### Tokenizer (root) - `fineweb_16384_bpe.model` — SentencePiece model (455 KB). - `fineweb_16384_bpe.vocab` — Human-readable vocab sidecar (185 KB). ### Tokenized shards (`shards/`) - `shards/fineweb_val_000000.bin` — val shard, 42,266,034 tokens (~84 MB, uint16). - `shards/fineweb_train_{000000..000132}.bin` — 133 train shards, 13,262,831,920 tokens (~25 GB total, uint16). Shards are flat `uint16` little-endian token streams, no header, concatenated docs with no separators. Each shard ends on a doc boundary. First 50k docs of `docs_selected.jsonl` → val, rest → train (file order, per the upstream manifest contract). ## Submission-day usage Pull the whole thing in ~5 min on a typical pod: ```python from huggingface_hub import snapshot_download local = snapshot_download( repo_id="Natooka/parameter-golf-sp-tokenizers", repo_type="dataset", local_dir="baselines/parameter_golf", # Optional — get just shards, or just the model: # allow_patterns=["shards/*.bin", "fineweb_*.model"], ) ``` After the download you have `baselines/parameter_golf/shards/*.bin` + `baselines/parameter_golf/fineweb_16384_bpe.model`. Point your training runner at those paths. ## Training configuration (tokenizer) | Setting | Value | |---|---| | `vocab_size` | 16384 | | `model_type` | BPE | | `byte_fallback` | True | | `character_coverage` | 1.0 | | `shuffle_input_sentence` | False (locks determinism) | | `sp_seed` | 1337 | | Training docs | 5,000,000 (first 5M post-val-split, per manifest convention) | | Source corpus | [`willdepueoai/parameter-golf`](https://huggingface.co/datasets/willdepueoai/parameter-golf) → `docs_selected.jsonl` | | Source revision | `9bb295ddab0e05d785b879661af7260fed5140fc` | | SentencePiece | 0.2.1 | Training was single-threaded in the BPE merge phase (SP 0.2.1 limitation — `num_threads` helps normalization only). Wall-clock on 28 vCPU Xeon 8480+ host: ~25 min SP training + ~30 min full-corpus tokenization. ## Reproducing from scratch ```bash python scripts/build_sp_shards.py \ --docs-path path/to/docs_selected.jsonl \ --vocab-size 16384 \ --sp-seed 1337 \ --sp-train-docs 5000000 \ --num-workers 28 ``` Byte-identical outputs are guaranteed within a matching `(vocab_size, sp_seed, num_workers)` triple — SP's multi-threaded merge counting can drift on tie-breaks across thread counts. Use the same `--num-workers` for cross-machine determinism, or pin to `--num-workers 1` for strict identity. ## License CC-BY 4.0 for our artifacts (tokenizer + pre-tokenized shards). Upstream `docs_selected.jsonl` subject to the Parameter Golf competition's terms (from `willdepueoai/parameter-golf`).

提供机构：

Natooka

5,000+

优质数据集

54 个

任务类型

进入经典数据集