CraneAILabs/luganda-tokenizer-evaluation
收藏Hugging Face2026-04-08 更新2026-05-10 收录
下载链接:
https://hf-mirror.com/datasets/CraneAILabs/luganda-tokenizer-evaluation
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
language:
- lug
- en
tags:
- luganda
- tokenizer
- evaluation
- low-resource
- african-languages
- morphology
- bpc
- fertility
- cross-lingual
pretty_name: Luganda Tokenizer Evaluation
viewer: false
---
# Luganda Tokenizer Evaluation
Cross-lingual evaluation of **9 decoder-only LLMs** on English (CoNLL-2003, 5,396 sentences) and Luganda (MasakhaNER, 6,055 sentences), quantifying the "tokenizer tax" that low-resource languages pay when using tokenizers designed for English.
> **Key finding**: Luganda text requires **70% more tokens** than equivalent English text on average (fertility ratio 1.70×), translating directly to higher API costs, more compute, and reduced effective context windows.
## Contents
| File | Description |
|------|-------------|
| `English_Tokenizer_Report.pdf` | Full report with English baselines, Luganda results, cross-lingual comparison, and recommendations |
| `tokenisation_examples.json` | Pre-computed tokenization examples for all models on MasakhaNER Luganda |
| `luganda_eval.py` | Evaluation script for fertility and BPC analysis |
| `tokenization_examples_generator.py` | Script to regenerate tokenization examples |
| `requirements_tokenization.txt` | Python dependencies |
> **Note**: This is a research artifact repository. Use the PDF report for complete results. The JSON file contains raw tokenization data for further analysis.
## The Tokenizer Tax
| Metric | English | Luganda | Ratio (Tax) |
|--------|---------|---------|-------------|
| Average fertility | 1.77 tokens/word | 2.98 tokens/word | **1.70×** |
| Best fertility | 1.55 (gpt2) | 2.58 (gpt-oss-20b) | 1.66× |
| Worst fertility | 1.92 (Qwen) | 3.27 (gpt2) | 1.70× |
## Tokenizer Efficiency Rankings (Luganda)
| Rank | Model | Fertility ↓ | Chars/Token | Single-Char % |
|------|-------|-------------|-------------|---------------|
| 1 | gpt-oss-20b | 2.58 | 2.55 | 16.9% |
| 2 | **ganda-gemma-1b** | **2.85** | 2.31 | 15.8% |
| 3 | gemma-3-1b-it | 2.85 | 2.31 | 15.8% |
| 4 | c4ai-command-r7b | 2.90 | 2.27 | 24.3% |
| 5 | Meta-Llama-3.1-8B | 2.96 | 2.19 | 23.9% |
| 6 | Sunflower-14B | 3.03 | 2.15 | 26.2% |
| 7 | Sunflower-32B | 3.03 | 2.15 | 26.2% |
| 8 | Qwen2.5-1.5B | 3.03 | 2.15 | 26.2% |
| 9 | gpt-neox-20b | 3.17 | 2.06 | 29.7% |
| 10 | gpt2 | 3.27 | 2.00 | 33.9% |
## Language Modeling Quality (Luganda, BPC)
| Rank | Model | BPC ↓ | Token PPL |
|------|-------|-------|-----------|
| 1 | **Sunflower-32B** | **1.45** | 8.7 |
| 2 | Sunflower-14B | 1.48 | 9.1 |
| 2 | **ganda-gemma-1b** | **2.67** | 72.7 |
| 4 | Meta-Llama-3.1-8B | 2.87 | 78.5 |
| 5 | gpt-neox-20b | 3.85 | 242.6 |
| 6 | c4ai-command-r7b | 3.85 | 430.7 |
| 7 | Qwen2.5-1.5B | 3.89 | 328.7 |
| 8 | gpt2 | 4.51 | 528.6 |
| 9 | gemma-3-1b-it | 4.71 | 1917.0 |
## Cross-Lingual Fertility Gap
| Model | English | Luganda | Ratio | Assessment |
|-------|---------|---------|-------|------------|
| gpt-oss-20b | 1.67 | 2.58 | 1.55× | ✓ Low tax |
| gemma-3-1b-it | 1.84 | 2.85 | 1.55× | ✓ Low tax |
| c4ai-command-r7b | 1.85 | 2.90 | 1.56× | ✓ Low tax |
| Meta-Llama-3.1-8B | 1.69 | 2.96 | 1.75× | ~ Moderate |
| gpt-neox-20b | 1.58 | 3.17 | 2.01× | ✗ High tax |
| gpt2 | 1.55 | 3.27 | 2.11× | ✗ High tax |
## Key Insights
1. **The Tokenizer Tax**: Luganda requires 70% more tokens than English on average. This is driven by training data imbalance, Luganda's agglutinative morphology, and vocabulary allocation favoring English subwords.
2. **ganda-gemma-1b**: Competitive tokenizer efficiency (3rd best fertility at 2.85) with reasonable BPC (2.67, 3rd best). Purpose-built for Luganda but doesn't match Sunflower's language modeling quality.
3. **The Sunflower Advantage**: Sunflower-32B achieves dramatically better Luganda BPC (1.45) by trading English efficiency. Best choice for Luganda language modeling quality.
4. **The GPT-2 Anti-Pattern**: Best for English (fertility 1.55), worst for Luganda (fertility 3.27, BPC 4.51). GPT-2's tokenizer is the worst choice for Luganda despite being excellent for English.
5. **Best by use case**:
- Lowest Luganda fertility: gpt-oss-20b (2.58)
- Best Luganda BPC: Sunflower-32B (1.45)
- Best Luganda-specific: ganda-gemma-1b (balanced efficiency + quality)
- Lowest cross-lingual tax: gpt-oss-20b (1.55× ratio)
## Methodology
- **Fertility**: Average tokens per whitespace-delimited word (95% bootstrap CIs)
- **BPC**: Bits per character — measures encoding efficiency (lower = better compression)
- **English data**: CoNLL-2003 NER (5,396 sentences) + MorphoLex (68,624 words)
- **Luganda data**: MasakhaNER (6,055 sentences)
- **Models**: 9 decoder-only LLMs evaluated successfully; 4 failed (incompatible formats or auth)
## Citation
```bibtex
@misc{craneailabs2026tokenizer,
title={Comparative Tokenizer Evaluation for Luganda Language Models: Quantifying the Tokenizer Tax},
author={Bakunga, Bronson and Mubiru, Kato Steven and Tukamushaba, Catherine},
year={2026},
publisher={Crane AI Labs},
url={https://huggingface.co/datasets/CraneAILabs/luganda-tokenizer-evaluation}
}
```
## Acknowledgments
Field research and Luganda linguistic validation conducted by Crane AI Labs. Supported by Fab Inc, funded by the Bill & Melinda Gates Foundation.
提供机构:
CraneAILabs



