CraneAILabs/luganda-tokenizer-evaluation

Name: CraneAILabs/luganda-tokenizer-evaluation
Creator: CraneAILabs
Published: 2026-04-08 16:36:22
License: 暂无描述

Hugging Face2026-04-08 更新2026-05-10 收录

下载链接：

https://hf-mirror.com/datasets/CraneAILabs/luganda-tokenizer-evaluation

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 language: - lug - en tags: - luganda - tokenizer - evaluation - low-resource - african-languages - morphology - bpc - fertility - cross-lingual pretty_name: Luganda Tokenizer Evaluation viewer: false --- # Luganda Tokenizer Evaluation Cross-lingual evaluation of **9 decoder-only LLMs** on English (CoNLL-2003, 5,396 sentences) and Luganda (MasakhaNER, 6,055 sentences), quantifying the "tokenizer tax" that low-resource languages pay when using tokenizers designed for English. > **Key finding**: Luganda text requires **70% more tokens** than equivalent English text on average (fertility ratio 1.70×), translating directly to higher API costs, more compute, and reduced effective context windows. ## Contents | File | Description | |------|-------------| | `English_Tokenizer_Report.pdf` | Full report with English baselines, Luganda results, cross-lingual comparison, and recommendations | | `tokenisation_examples.json` | Pre-computed tokenization examples for all models on MasakhaNER Luganda | | `luganda_eval.py` | Evaluation script for fertility and BPC analysis | | `tokenization_examples_generator.py` | Script to regenerate tokenization examples | | `requirements_tokenization.txt` | Python dependencies | > **Note**: This is a research artifact repository. Use the PDF report for complete results. The JSON file contains raw tokenization data for further analysis. ## The Tokenizer Tax | Metric | English | Luganda | Ratio (Tax) | |--------|---------|---------|-------------| | Average fertility | 1.77 tokens/word | 2.98 tokens/word | **1.70×** | | Best fertility | 1.55 (gpt2) | 2.58 (gpt-oss-20b) | 1.66× | | Worst fertility | 1.92 (Qwen) | 3.27 (gpt2) | 1.70× | ## Tokenizer Efficiency Rankings (Luganda) | Rank | Model | Fertility ↓ | Chars/Token | Single-Char % | |------|-------|-------------|-------------|---------------| | 1 | gpt-oss-20b | 2.58 | 2.55 | 16.9% | | 2 | **ganda-gemma-1b** | **2.85** | 2.31 | 15.8% | | 3 | gemma-3-1b-it | 2.85 | 2.31 | 15.8% | | 4 | c4ai-command-r7b | 2.90 | 2.27 | 24.3% | | 5 | Meta-Llama-3.1-8B | 2.96 | 2.19 | 23.9% | | 6 | Sunflower-14B | 3.03 | 2.15 | 26.2% | | 7 | Sunflower-32B | 3.03 | 2.15 | 26.2% | | 8 | Qwen2.5-1.5B | 3.03 | 2.15 | 26.2% | | 9 | gpt-neox-20b | 3.17 | 2.06 | 29.7% | | 10 | gpt2 | 3.27 | 2.00 | 33.9% | ## Language Modeling Quality (Luganda, BPC) | Rank | Model | BPC ↓ | Token PPL | |------|-------|-------|-----------| | 1 | **Sunflower-32B** | **1.45** | 8.7 | | 2 | Sunflower-14B | 1.48 | 9.1 | | 2 | **ganda-gemma-1b** | **2.67** | 72.7 | | 4 | Meta-Llama-3.1-8B | 2.87 | 78.5 | | 5 | gpt-neox-20b | 3.85 | 242.6 | | 6 | c4ai-command-r7b | 3.85 | 430.7 | | 7 | Qwen2.5-1.5B | 3.89 | 328.7 | | 8 | gpt2 | 4.51 | 528.6 | | 9 | gemma-3-1b-it | 4.71 | 1917.0 | ## Cross-Lingual Fertility Gap | Model | English | Luganda | Ratio | Assessment | |-------|---------|---------|-------|------------| | gpt-oss-20b | 1.67 | 2.58 | 1.55× | ✓ Low tax | | gemma-3-1b-it | 1.84 | 2.85 | 1.55× | ✓ Low tax | | c4ai-command-r7b | 1.85 | 2.90 | 1.56× | ✓ Low tax | | Meta-Llama-3.1-8B | 1.69 | 2.96 | 1.75× | ~ Moderate | | gpt-neox-20b | 1.58 | 3.17 | 2.01× | ✗ High tax | | gpt2 | 1.55 | 3.27 | 2.11× | ✗ High tax | ## Key Insights 1. **The Tokenizer Tax**: Luganda requires 70% more tokens than English on average. This is driven by training data imbalance, Luganda's agglutinative morphology, and vocabulary allocation favoring English subwords. 2. **ganda-gemma-1b**: Competitive tokenizer efficiency (3rd best fertility at 2.85) with reasonable BPC (2.67, 3rd best). Purpose-built for Luganda but doesn't match Sunflower's language modeling quality. 3. **The Sunflower Advantage**: Sunflower-32B achieves dramatically better Luganda BPC (1.45) by trading English efficiency. Best choice for Luganda language modeling quality. 4. **The GPT-2 Anti-Pattern**: Best for English (fertility 1.55), worst for Luganda (fertility 3.27, BPC 4.51). GPT-2's tokenizer is the worst choice for Luganda despite being excellent for English. 5. **Best by use case**: - Lowest Luganda fertility: gpt-oss-20b (2.58) - Best Luganda BPC: Sunflower-32B (1.45) - Best Luganda-specific: ganda-gemma-1b (balanced efficiency + quality) - Lowest cross-lingual tax: gpt-oss-20b (1.55× ratio) ## Methodology - **Fertility**: Average tokens per whitespace-delimited word (95% bootstrap CIs) - **BPC**: Bits per character — measures encoding efficiency (lower = better compression) - **English data**: CoNLL-2003 NER (5,396 sentences) + MorphoLex (68,624 words) - **Luganda data**: MasakhaNER (6,055 sentences) - **Models**: 9 decoder-only LLMs evaluated successfully; 4 failed (incompatible formats or auth) ## Citation ```bibtex @misc{craneailabs2026tokenizer, title={Comparative Tokenizer Evaluation for Luganda Language Models: Quantifying the Tokenizer Tax}, author={Bakunga, Bronson and Mubiru, Kato Steven and Tukamushaba, Catherine}, year={2026}, publisher={Crane AI Labs}, url={https://huggingface.co/datasets/CraneAILabs/luganda-tokenizer-evaluation} } ``` ## Acknowledgments Field research and Luganda linguistic validation conducted by Crane AI Labs. Supported by Fab Inc, funded by the Bill & Melinda Gates Foundation.

提供机构：

CraneAILabs

5,000+

优质数据集

54 个

任务类型

进入经典数据集