anrilombard/mzansi-text

Name: anrilombard/mzansi-text
Creator: anrilombard
Published: 2026-03-25 03:46:13
License: 暂无描述

Hugging Face2026-03-25 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/anrilombard/mzansi-text

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - af - en - nso - sot - ssw - tsn - tso - ven - xho - zul - nbl tags: - pretraining - south-african-languages - multilingual - mzansitext license: apache-2.0 --- # MzansiText **MzansiText** is a curated multilingual pretraining corpus for all eleven official South African languages. [![GitHub](https://img.shields.io/badge/GitHub-Anri--Lombard/sallm-blue)](https://github.com/Anri-Lombard/sallm) [![Paper](https://img.shields.io/badge/Paper-arXiv_2603.20732-red.svg)](https://arxiv.org/abs/2603.20732) [![Model](https://img.shields.io/badge/Model-MzansiLM_125M-green)](https://huggingface.co/anrilombard/mzansilm-125m) [![Collection](https://img.shields.io/badge/Collection-MzansiLM-orange)](https://huggingface.co/collections/anrilombard/mzansilm-69635ca7b60efedb9dfcb09e) ## Dataset Details - Languages: `af`, `en`, `nso`, `sot`, `ssw`, `tsn`, `tso`, `ven`, `xho`, `zul`, `nbl` - Schema: ```json { "text": "string", "lang": "string" } ``` - This repository contains the raw train, validation, and test text splits used for the MzansiLM pretraining release. - The token distribution table below matches the paper-reported corpus statistics. ### Token Distribution (after filtering + 65,536-vocab BPE tokenizer) | Language | Train Tokens | % | Val Tokens | Test Tokens | |---|---:|---:|---:|---:| | Afrikaans | 2,475,913,822 | 64.96 | 1,865,255 | 1,875,605 | | English | 740,994,679 | 19.44 | 1,813,651 | 1,821,803 | | isiZulu | 320,224,015 | 8.40 | 2,017,406 | 2,021,343 | | isiXhosa | 152,212,403 | 3.99 | 2,016,503 | 2,012,000 | | Sesotho | 97,558,939 | 2.56 | 2,315,298 | 2,316,170 | | Setswana | 10,082,930 | 0.26 | 1,216,539 | 1,413,473 | | Sepedi | 6,697,358 | 0.18 | 685,425 | 778,656 | | Xitsonga | 3,013,408 | 0.08 | 510,463 | 319,496 | | siSwati | 1,932,989 | 0.05 | 196,247 | 225,810 | | Tshivenda | 1,852,481 | 0.05 | 191,495 | 243,315 | | isiNdebele | 818,549 | 0.02 | 106,224 | 143,458 | | **Total** | **3,811,301,573** | **100** | **12,934,506** | **13,171,129** | Validation and test sets are capped at approximately 2M tokens per language to prevent high-resource languages from dominating early stopping. ## Usage ```python from datasets import load_dataset ds = load_dataset("anrilombard/mzansi-text", split="train") print(ds[0]) ``` ## Related Releases - Paper: [arXiv:2603.20732](https://arxiv.org/abs/2603.20732) - Model: [anrilombard/mzansilm-125m](https://huggingface.co/anrilombard/mzansilm-125m) - Tokenized corpus: [anrilombard/mzansi-text-tokenized](https://huggingface.co/datasets/anrilombard/mzansi-text-tokenized) - GitHub code and configs: [https://github.com/Anri-Lombard/sallm](https://github.com/Anri-Lombard/sallm) Full preprocessing pipeline (including this exact cleaning script) is in [`data/cleaning/`](https://github.com/Anri-Lombard/sallm/tree/main/data/cleaning) on GitHub. ## Citation Please cite the paper: ```bibtex @misc{lombard2026mzansitextmzansilmopencorpus, title={MzansiText and MzansiLM: An Open Corpus and Decoder-Only Language Model for South African Languages}, author={Anri Lombard and Simbarashe Mawere and Temi Aina and Ethan Wolff and Sbonelo Gumede and Elan Novick and Francois Meyer and Jan Buys}, year={2026}, eprint={2603.20732}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2603.20732}, } ``` ## License Apache License 2.0

提供机构：

anrilombard

5,000+

优质数据集

54 个

任务类型

进入经典数据集