anrilombard/mzansi-text
收藏Hugging Face2026-03-25 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/anrilombard/mzansi-text
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- af
- en
- nso
- sot
- ssw
- tsn
- tso
- ven
- xho
- zul
- nbl
tags:
- pretraining
- south-african-languages
- multilingual
- mzansitext
license: apache-2.0
---
# MzansiText
**MzansiText** is a curated multilingual pretraining corpus for all eleven official South African languages.
[](https://github.com/Anri-Lombard/sallm)
[](https://arxiv.org/abs/2603.20732)
[](https://huggingface.co/anrilombard/mzansilm-125m)
[](https://huggingface.co/collections/anrilombard/mzansilm-69635ca7b60efedb9dfcb09e)
## Dataset Details
- Languages: `af`, `en`, `nso`, `sot`, `ssw`, `tsn`, `tso`, `ven`, `xho`, `zul`, `nbl`
- Schema:
```json
{
"text": "string",
"lang": "string"
}
```
- This repository contains the raw train, validation, and test text splits used for the MzansiLM pretraining release.
- The token distribution table below matches the paper-reported corpus statistics.
### Token Distribution (after filtering + 65,536-vocab BPE tokenizer)
| Language | Train Tokens | % | Val Tokens | Test Tokens |
|---|---:|---:|---:|---:|
| Afrikaans | 2,475,913,822 | 64.96 | 1,865,255 | 1,875,605 |
| English | 740,994,679 | 19.44 | 1,813,651 | 1,821,803 |
| isiZulu | 320,224,015 | 8.40 | 2,017,406 | 2,021,343 |
| isiXhosa | 152,212,403 | 3.99 | 2,016,503 | 2,012,000 |
| Sesotho | 97,558,939 | 2.56 | 2,315,298 | 2,316,170 |
| Setswana | 10,082,930 | 0.26 | 1,216,539 | 1,413,473 |
| Sepedi | 6,697,358 | 0.18 | 685,425 | 778,656 |
| Xitsonga | 3,013,408 | 0.08 | 510,463 | 319,496 |
| siSwati | 1,932,989 | 0.05 | 196,247 | 225,810 |
| Tshivenda | 1,852,481 | 0.05 | 191,495 | 243,315 |
| isiNdebele | 818,549 | 0.02 | 106,224 | 143,458 |
| **Total** | **3,811,301,573** | **100** | **12,934,506** | **13,171,129** |
Validation and test sets are capped at approximately 2M tokens per language to prevent high-resource languages from dominating early stopping.
## Usage
```python
from datasets import load_dataset
ds = load_dataset("anrilombard/mzansi-text", split="train")
print(ds[0])
```
## Related Releases
- Paper: [arXiv:2603.20732](https://arxiv.org/abs/2603.20732)
- Model: [anrilombard/mzansilm-125m](https://huggingface.co/anrilombard/mzansilm-125m)
- Tokenized corpus: [anrilombard/mzansi-text-tokenized](https://huggingface.co/datasets/anrilombard/mzansi-text-tokenized)
- GitHub code and configs: [https://github.com/Anri-Lombard/sallm](https://github.com/Anri-Lombard/sallm)
Full preprocessing pipeline (including this exact cleaning script) is in [`data/cleaning/`](https://github.com/Anri-Lombard/sallm/tree/main/data/cleaning) on GitHub.
## Citation
Please cite the paper:
```bibtex
@misc{lombard2026mzansitextmzansilmopencorpus,
title={MzansiText and MzansiLM: An Open Corpus and Decoder-Only Language Model for South African Languages},
author={Anri Lombard and Simbarashe Mawere and Temi Aina and Ethan Wolff and Sbonelo Gumede and Elan Novick and Francois Meyer and Jan Buys},
year={2026},
eprint={2603.20732},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2603.20732},
}
```
## License
Apache License 2.0
提供机构:
anrilombard



