anrilombard/mzansi-text-tokenized
收藏Hugging Face2026-03-25 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/anrilombard/mzansi-text-tokenized
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- af
- en
- nso
- sot
- ssw
- tsn
- tso
- ven
- xho
- zul
- nbl
tags:
- pretraining
- tokenized
- south-african-languages
- multilingual
- mzansitext
license: apache-2.0
---
# MzansiText Tokenized
Ready-to-train tokenized version of **MzansiText**, chunked to a context length of 2048 tokens.
[](https://github.com/Anri-Lombard/sallm)
[](https://arxiv.org/abs/2603.20732)
[](https://huggingface.co/anrilombard/mzansilm-125m)
[](https://huggingface.co/collections/anrilombard/mzansilm-69635ca7b60efedb9dfcb09e)
## Dataset Details
- Tokenizer: custom BPE, `65536` vocabulary
- Chunking: `2048` tokens per example with EOS separators between documents
- Schema:
```json
{
"input_ids": ["int"],
"lang": "string"
}
```
### Split Sizes
| Split | Examples |
|---|---:|
| Train | 3,943,584 |
| Validation | 19,379 |
| Test | 19,341 |
## Usage
```python
from datasets import load_dataset
ds = load_dataset("anrilombard/mzansi-text-tokenized", split="train")
print(ds[0].keys())
```
## Related Releases
- Paper: [arXiv:2603.20732](https://arxiv.org/abs/2603.20732)
- Model: [anrilombard/mzansilm-125m](https://huggingface.co/anrilombard/mzansilm-125m)
- Raw corpus: [anrilombard/mzansi-text](https://huggingface.co/datasets/anrilombard/mzansi-text)
- GitHub code and configs: [https://github.com/Anri-Lombard/sallm](https://github.com/Anri-Lombard/sallm)
Full preprocessing pipeline (including this exact cleaning script) is in [`data/cleaning/`](https://github.com/Anri-Lombard/sallm/tree/main/data/cleaning) on GitHub.
## Citation
Please cite the paper:
```bibtex
@misc{lombard2026mzansitextmzansilmopencorpus,
title={MzansiText and MzansiLM: An Open Corpus and Decoder-Only Language Model for South African Languages},
author={Anri Lombard and Simbarashe Mawere and Temi Aina and Ethan Wolff and Sbonelo Gumede and Elan Novick and Francois Meyer and Jan Buys},
year={2026},
eprint={2603.20732},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2603.20732},
}
```
## License
Apache License 2.0
---
language:
- 南非荷兰语(Afrikaans)
- 英语(English)
- 北索托语(Northern Sotho)
- 南索托语(Southern Sotho)
- 斯威士语(Swati)
- 茨瓦纳语(Tswana)
- 聪加语(Tsonga)
- 文达语(Venda)
- 科萨语(Xhosa)
- 祖鲁语(Zulu)
- 南恩德贝莱语(Southern Ndebele)
tags:
- 预训练
- 已分词
- 南非语言
- 多语言
- mzansitext
license: apache-2.0
---
# MzansiText 分词版
**MzansiText**的可直接用于训练的分词版本,已被切分为上下文长度为2048个Token的样本块。
[](https://github.com/Anri-Lombard/sallm)
[](https://arxiv.org/abs/2603.20732)
[](https://huggingface.co/anrilombard/mzansilm-125m)
[](https://huggingface.co/collections/anrilombard/mzansilm-69635ca7b60efedb9dfcb09e)
## 数据集详情
- 分词器:自定义字节对编码(BPE),词汇表规模为65536
- 分块策略:每个样本包含2048个Token,文档间以结束符(EOS)作为分隔符
- 数据结构:
json
{
"input_ids": ["int"],
"lang": "string"
}
### 数据集划分规模
| 数据集划分 | 样本数量 |
|---|---:|
| 训练集 | 3,943,584 |
| 验证集 | 19,379 |
| 测试集 | 19,341 |
## 使用方法
python
from datasets import load_dataset
ds = load_dataset("anrilombard/mzansi-text-tokenized", split="train")
print(ds[0].keys())
## 相关发布项目
- 论文:[arXiv:2603.20732](https://arxiv.org/abs/2603.20732)
- 模型:[anrilombard/mzansilm-125m](https://huggingface.co/anrilombard/mzansilm-125m)
- 原始语料库:[anrilombard/mzansi-text](https://huggingface.co/datasets/anrilombard/mzansi-text)
- GitHub代码与配置文件:[https://github.com/Anri-Lombard/sallm](https://github.com/Anri-Lombard/sallm)
完整的预处理流水线(包含本数据集所用的清洗脚本)可在GitHub仓库的[`data/cleaning/`](https://github.com/Anri-Lombard/sallm/tree/main/data/cleaning)目录中获取。
## 引用方式
请引用如下论文:
bibtex
@misc{lombard2026mzansitextmzansilmopencorpus,
title={MzansiText and MzansiLM: An Open Corpus and Decoder-Only Language Model for South African Languages},
author={Anri Lombard and Simbarashe Mawere and Temi Aina and Ethan Wolff and Sbonelo Gumede and Elan Novick and Francois Meyer and Jan Buys},
year={2026},
eprint={2603.20732},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2603.20732},
}
## 许可证
Apache许可证2.0
提供机构:
anrilombard



