StentorLabs/BPE-vs.-Unigram-Tokenization-at-Constrained-Vocabulary-Sizes

Name: StentorLabs/BPE-vs.-Unigram-Tokenization-at-Constrained-Vocabulary-Sizes
Creator: StentorLabs
Published: 2026-04-08 04:25:28
License: 暂无描述

Hugging Face2026-04-08 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/StentorLabs/BPE-vs.-Unigram-Tokenization-at-Constrained-Vocabulary-Sizes

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en pretty_name: "BPE vs Unigram Tokenization at Constrained Vocabulary Sizes (4K–16K): A Systematic Review for English-Centric Small Language Models" tags: - tokenization - byte-pair-encoding - unigram - sentencepiece - small-language-models - nlp - subword-tokenization - vocabulary - fertility - huggingface - transformers - text license: cc-by-4.0 task_categories: - text-generation size_categories: - n<1K --- # BPE vs Unigram Tokenization at Constrained Vocabulary Sizes (4K–16K) ## A Systematic Review for English-Centric Small Language Models **Author:** Kai Izumoto — StentorLabs Independent Research **Date:** April 2026 **Contact:** StentorLabs@gmail.com --- ## Overview This dataset repository hosts an informal technical review paper examining the choice between **Byte-Pair Encoding (BPE)** and the **Unigram Language Model** tokenization algorithm for English-centric small language models (SLMs) trained at vocabulary sizes of **4,000 to 16,000 tokens**. This is an independent research document, not a peer-reviewed publication. Feedback welcome. --- ## Files Included | File | Description | |------|-------------| | `BPE_vs_Unigram.md` | Markdown version of the full paper | | `BPE_vs_Unigram.pdf` | PDF version of the full paper | | `BPE_vs_Unigram.docx` | Original Word document | --- ## What This Paper Covers The paper synthesises findings from over fifty sources (predominantly January 2025 – April 2026) across the following topics: - **Tokenization fertility** — mean tokens per word — and its downstream effects on attention cost, context window use, and model performance - **Empirical comparisons** of BPE vs Unigram at 4K, 8K, and 16K vocabulary sizes, including ACL BabyLM 2025 findings - **Why the Unigram fertility advantage is context-dependent** — robust in multilingual/morphologically rich settings, but narrows or reverses for English-only corpora at small vocab sizes - **Practical constraints**: SentencePiece Unigram's RAM requirements during training, numerical instabilities, and HuggingFace ecosystem compatibility - **Parameter efficiency** in sub-50M parameter models and the outsized role of embedding table size - **Emerging BPE extensions**: SuperBPE, Length-MAX, entropy-driven pre-tokenization --- ## Key Findings **On downstream performance:** Mixed and architecture-dependent. LSTM models at 8K slightly favor BPE; transformer models at 8K slightly favor Unigram. Neither algorithm dominates clearly across vocabulary sizes. **On practical constraints:** BPE wins clearly — lower RAM to train, deterministic output, superior native HuggingFace integration, no extra dependencies. **Overall recommendation:** For English-only SLM development on FineWeb-type corpora at 4K–16K vocabulary, BPE as implemented in the HuggingFace `tokenizers` library is preferable — primarily on practical grounds, with directional empirical support. --- ## Intended Audience - Researchers and practitioners training small language models on constrained compute budgets - Anyone choosing between BPE and SentencePiece Unigram for an English-only tokenizer - People working in the 4K–16K vocabulary range specifically --- ## Keywords subword tokenization, byte-pair encoding, unigram language model, SentencePiece, small language models, vocabulary size, fertility, token efficiency, HuggingFace, FineWeb --- ## Citation If you find this useful, you can cite it informally as: ``` Izumoto, K. (2026). Byte-Pair Encoding versus Unigram Language Model Tokenization at Constrained Vocabulary Sizes (4K–16K): A Systematic Review for English-Centric Small Language Models. StentorLabs Independent Research. Available at: https://huggingface.co/datasets/[your-username]/bpe-vs-unigram-review ``` --- ## Disclaimer This is an informal technical review, not a peer-reviewed paper. The author has made every effort to accurately represent the cited literature as of April 2026. Corrections and feedback are welcome via the community tab.

提供机构：

StentorLabs

5,000+

优质数据集

54 个

任务类型

进入经典数据集