LocaleNLP/AfriCorpus-v1
收藏Hugging Face2026-04-19 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/LocaleNLP/AfriCorpus-v1
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
language:
- wo
- sw
- ha
- yo
- am
- ti
- so
- ig
- zu
- ar
tags:
- african-languages
- nlp
- multilingual
- text-generation
- low-resource
pretty_name: AfriCorpus v1
size_categories:
- 10M<n<100M
task_categories:
- text-generation
- fill-mask
---
# AfriCorpus v1
**AfriCorpus-v1** is the first public release of LocaleNLP's audited, deduplicated, and quality-filtered African language corpus. Built to power the AfriLION LLM project, this dataset directly addresses the **Tokenizer Fertility** problem that causes all current LLMs to underperform on African languages.
## Key Statistics
| Language | Code | Script | CC-100 Source | Status |
|----------|------|--------|---------------|--------|
| Wolof | `wo` | Latin | CC-100 | Audited |
| Swahili | `sw` | Latin | CC-100 | Audited |
| Hausa | `ha` | Latin + Ajami | CC-100 | Audited |
| Yoruba | `yo` | Latin | CC-100 | Audited |
| Amharic | `am` | Ge'ez (Ethiopic) | CC-100 | Audited |
| Tigrinya | `ti` | Ge'ez (Ethiopic) | CC-100 | In Progress |
| Somali | `so` | Latin | CC-100 | In Progress |
| Igbo | `ig` | Latin | CC-100 | In Progress |
| Zulu | `zu` | Latin | CC-100 | In Progress |
## Quality Assurance Pipeline
Every document in this corpus has passed through a 7-stage pipeline:
1. **Download** — CC-100 `.txt.xz` source files from StatMT.
2. **Language-ID Filter** — `langdetect` with confidence threshold > 0.90.
3. **Text Cleaning** — URL removal, HTML stripping, control character normalization.
4. **Deduplication** — MinHash LSH (threshold 0.85, 128 permutations), including cross-lingual dedup.
5. **Length Filter** — Only sentences with 20–2048 whitespace tokens are kept.
6. **JSONL Sharding** — 100k lines per shard for streaming compatibility.
7. **Upload** — Published here with provenance metadata on every record.
## Critical Design Decisions
### Ge'ez Script Handling
Amharic and Tigrinya use the Ge'ez (Ethiopic) script which has ~500 base syllabic characters. Each combination is a unique glyph, leading to thousands of distinct characters. Training on this corpus requires `character_coverage=0.9999` in SentencePiece. **Do not lower this value** or your tokenizer will produce `<0xE1><0x88><0xA0>` byte-fallback tokens instead of actual Ge'ez glyphs, silently corrupting Amharic model training.
### Equal Upsampling
Wolof has ~40MB of CC-100 data; Swahili has ~6.6GB. A proportionally-weighted tokenizer devotes most of its vocab budget to Swahili, leaving Wolof with ~200 tokens that fragment every word into 5–6 pieces. Our tokenizer training script upsamples Wolof **150x** to achieve equal representation.
### Lang ID Tokens
Every document is prepended with a language ID token (`[WO]`, `[SW]`, `[HA]`, `[AM]`, etc.) during tokenizer training. This enables the model to condition on language at inference time — critical for code-switching and per-language perplexity measurement.
## Usage
```python
from datasets import load_dataset
# Load a specific language
ds = load_dataset("LocaleNLP/AfriCorpus-v1", split="wo")
print(ds[0])
# {'text': 'Nanga def, baal ma.', 'lang': 'wo', 'lang_name': 'Wolof',
# 'token_count': 5, 'source': 'cc100'}
# Load all languages
ds_all = load_dataset("LocaleNLP/AfriCorpus-v1")
```
## Citation
If you use this dataset, please cite:
```bibtex
@dataset{africorpus_v1_2026,
title = {AfriCorpus v1: Audited African Language Corpus for LLM Training},
author = {Jagne, Alieu and LocaleNLP Team},
year = {2026},
url = {https://huggingface.co/datasets/LocaleNLP/AfriCorpus-v1},
license = {cc-by-4.0}
}
```
## Related Resources
- **GitHub:** [LocaleNLP/afrilion](https://github.com/LocaleNLP/afrilion)
- **Model:** [LocaleNLP/afrilion-base](https://huggingface.co/LocaleNLP/afrilion-base)
- **Community:** [Masakhane](https://github.com/masakhane-io/masakhane)
提供机构:
LocaleNLP



