five

LocaleNLP/AfriCorpus-v1

收藏
Hugging Face2026-04-19 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/LocaleNLP/AfriCorpus-v1
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 language: - wo - sw - ha - yo - am - ti - so - ig - zu - ar tags: - african-languages - nlp - multilingual - text-generation - low-resource pretty_name: AfriCorpus v1 size_categories: - 10M<n<100M task_categories: - text-generation - fill-mask --- # AfriCorpus v1 **AfriCorpus-v1** is the first public release of LocaleNLP's audited, deduplicated, and quality-filtered African language corpus. Built to power the AfriLION LLM project, this dataset directly addresses the **Tokenizer Fertility** problem that causes all current LLMs to underperform on African languages. ## Key Statistics | Language | Code | Script | CC-100 Source | Status | |----------|------|--------|---------------|--------| | Wolof | `wo` | Latin | CC-100 | Audited | | Swahili | `sw` | Latin | CC-100 | Audited | | Hausa | `ha` | Latin + Ajami | CC-100 | Audited | | Yoruba | `yo` | Latin | CC-100 | Audited | | Amharic | `am` | Ge'ez (Ethiopic) | CC-100 | Audited | | Tigrinya | `ti` | Ge'ez (Ethiopic) | CC-100 | In Progress | | Somali | `so` | Latin | CC-100 | In Progress | | Igbo | `ig` | Latin | CC-100 | In Progress | | Zulu | `zu` | Latin | CC-100 | In Progress | ## Quality Assurance Pipeline Every document in this corpus has passed through a 7-stage pipeline: 1. **Download** — CC-100 `.txt.xz` source files from StatMT. 2. **Language-ID Filter** — `langdetect` with confidence threshold > 0.90. 3. **Text Cleaning** — URL removal, HTML stripping, control character normalization. 4. **Deduplication** — MinHash LSH (threshold 0.85, 128 permutations), including cross-lingual dedup. 5. **Length Filter** — Only sentences with 20–2048 whitespace tokens are kept. 6. **JSONL Sharding** — 100k lines per shard for streaming compatibility. 7. **Upload** — Published here with provenance metadata on every record. ## Critical Design Decisions ### Ge'ez Script Handling Amharic and Tigrinya use the Ge'ez (Ethiopic) script which has ~500 base syllabic characters. Each combination is a unique glyph, leading to thousands of distinct characters. Training on this corpus requires `character_coverage=0.9999` in SentencePiece. **Do not lower this value** or your tokenizer will produce `<0xE1><0x88><0xA0>` byte-fallback tokens instead of actual Ge'ez glyphs, silently corrupting Amharic model training. ### Equal Upsampling Wolof has ~40MB of CC-100 data; Swahili has ~6.6GB. A proportionally-weighted tokenizer devotes most of its vocab budget to Swahili, leaving Wolof with ~200 tokens that fragment every word into 5–6 pieces. Our tokenizer training script upsamples Wolof **150x** to achieve equal representation. ### Lang ID Tokens Every document is prepended with a language ID token (`[WO]`, `[SW]`, `[HA]`, `[AM]`, etc.) during tokenizer training. This enables the model to condition on language at inference time — critical for code-switching and per-language perplexity measurement. ## Usage ```python from datasets import load_dataset # Load a specific language ds = load_dataset("LocaleNLP/AfriCorpus-v1", split="wo") print(ds[0]) # {'text': 'Nanga def, baal ma.', 'lang': 'wo', 'lang_name': 'Wolof', # 'token_count': 5, 'source': 'cc100'} # Load all languages ds_all = load_dataset("LocaleNLP/AfriCorpus-v1") ``` ## Citation If you use this dataset, please cite: ```bibtex @dataset{africorpus_v1_2026, title = {AfriCorpus v1: Audited African Language Corpus for LLM Training}, author = {Jagne, Alieu and LocaleNLP Team}, year = {2026}, url = {https://huggingface.co/datasets/LocaleNLP/AfriCorpus-v1}, license = {cc-by-4.0} } ``` ## Related Resources - **GitHub:** [LocaleNLP/afrilion](https://github.com/LocaleNLP/afrilion) - **Model:** [LocaleNLP/afrilion-base](https://huggingface.co/LocaleNLP/afrilion-base) - **Community:** [Masakhane](https://github.com/masakhane-io/masakhane)
提供机构:
LocaleNLP
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作