StentorLabs/BPE-vs.-Unigram-Tokenization-at-Constrained-Vocabulary-Sizes
收藏Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/StentorLabs/BPE-vs.-Unigram-Tokenization-at-Constrained-Vocabulary-Sizes
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
pretty_name: "BPE vs Unigram Tokenization at Constrained Vocabulary Sizes (4K–16K): A Systematic Review for English-Centric Small Language Models"
tags:
- tokenization
- byte-pair-encoding
- unigram
- sentencepiece
- small-language-models
- nlp
- subword-tokenization
- vocabulary
- fertility
- huggingface
- transformers
- text
license: cc-by-4.0
task_categories:
- text-generation
size_categories:
- n<1K
---
# BPE vs Unigram Tokenization at Constrained Vocabulary Sizes (4K–16K)
## A Systematic Review for English-Centric Small Language Models
**Author:** Kai Izumoto — StentorLabs Independent Research
**Date:** April 2026
**Contact:** StentorLabs@gmail.com
---
## Overview
This dataset repository hosts an informal technical review paper examining the choice between **Byte-Pair Encoding (BPE)** and the **Unigram Language Model** tokenization algorithm for English-centric small language models (SLMs) trained at vocabulary sizes of **4,000 to 16,000 tokens**.
This is an independent research document, not a peer-reviewed publication. Feedback welcome.
---
## Files Included
| File | Description |
|------|-------------|
| `BPE_vs_Unigram.md` | Markdown version of the full paper |
| `BPE_vs_Unigram.pdf` | PDF version of the full paper |
| `BPE_vs_Unigram.docx` | Original Word document |
---
## What This Paper Covers
The paper synthesises findings from over fifty sources (predominantly January 2025 – April 2026) across the following topics:
- **Tokenization fertility** — mean tokens per word — and its downstream effects on attention cost, context window use, and model performance
- **Empirical comparisons** of BPE vs Unigram at 4K, 8K, and 16K vocabulary sizes, including ACL BabyLM 2025 findings
- **Why the Unigram fertility advantage is context-dependent** — robust in multilingual/morphologically rich settings, but narrows or reverses for English-only corpora at small vocab sizes
- **Practical constraints**: SentencePiece Unigram's RAM requirements during training, numerical instabilities, and HuggingFace ecosystem compatibility
- **Parameter efficiency** in sub-50M parameter models and the outsized role of embedding table size
- **Emerging BPE extensions**: SuperBPE, Length-MAX, entropy-driven pre-tokenization
---
## Key Findings
**On downstream performance:** Mixed and architecture-dependent. LSTM models at 8K slightly favor BPE; transformer models at 8K slightly favor Unigram. Neither algorithm dominates clearly across vocabulary sizes.
**On practical constraints:** BPE wins clearly — lower RAM to train, deterministic output, superior native HuggingFace integration, no extra dependencies.
**Overall recommendation:** For English-only SLM development on FineWeb-type corpora at 4K–16K vocabulary, BPE as implemented in the HuggingFace `tokenizers` library is preferable — primarily on practical grounds, with directional empirical support.
---
## Intended Audience
- Researchers and practitioners training small language models on constrained compute budgets
- Anyone choosing between BPE and SentencePiece Unigram for an English-only tokenizer
- People working in the 4K–16K vocabulary range specifically
---
## Keywords
subword tokenization, byte-pair encoding, unigram language model, SentencePiece, small language models, vocabulary size, fertility, token efficiency, HuggingFace, FineWeb
---
## Citation
If you find this useful, you can cite it informally as:
```
Izumoto, K. (2026). Byte-Pair Encoding versus Unigram Language Model Tokenization
at Constrained Vocabulary Sizes (4K–16K): A Systematic Review for English-Centric
Small Language Models. StentorLabs Independent Research.
Available at: https://huggingface.co/datasets/[your-username]/bpe-vs-unigram-review
```
---
## Disclaimer
This is an informal technical review, not a peer-reviewed paper. The author has made every effort to accurately represent the cited literature as of April 2026. Corrections and feedback are welcome via the community tab.
提供机构:
StentorLabs



