anrilombard/mzansi-text-tokenized

Name: anrilombard/mzansi-text-tokenized
Creator: anrilombard
Published: 2026-03-25 03:46:15
License: 暂无描述

Hugging Face2026-03-25 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/anrilombard/mzansi-text-tokenized

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - af - en - nso - sot - ssw - tsn - tso - ven - xho - zul - nbl tags: - pretraining - tokenized - south-african-languages - multilingual - mzansitext license: apache-2.0 --- # MzansiText Tokenized Ready-to-train tokenized version of **MzansiText**, chunked to a context length of 2048 tokens. [![GitHub](https://img.shields.io/badge/GitHub-Anri--Lombard/sallm-blue)](https://github.com/Anri-Lombard/sallm) [![Paper](https://img.shields.io/badge/Paper-arXiv_2603.20732-red.svg)](https://arxiv.org/abs/2603.20732) [![Model](https://img.shields.io/badge/Model-MzansiLM_125M-green)](https://huggingface.co/anrilombard/mzansilm-125m) [![Collection](https://img.shields.io/badge/Collection-MzansiLM-orange)](https://huggingface.co/collections/anrilombard/mzansilm-69635ca7b60efedb9dfcb09e) ## Dataset Details - Tokenizer: custom BPE, `65536` vocabulary - Chunking: `2048` tokens per example with EOS separators between documents - Schema: ```json { "input_ids": ["int"], "lang": "string" } ``` ### Split Sizes | Split | Examples | |---|---:| | Train | 3,943,584 | | Validation | 19,379 | | Test | 19,341 | ## Usage ```python from datasets import load_dataset ds = load_dataset("anrilombard/mzansi-text-tokenized", split="train") print(ds[0].keys()) ``` ## Related Releases - Paper: [arXiv:2603.20732](https://arxiv.org/abs/2603.20732) - Model: [anrilombard/mzansilm-125m](https://huggingface.co/anrilombard/mzansilm-125m) - Raw corpus: [anrilombard/mzansi-text](https://huggingface.co/datasets/anrilombard/mzansi-text) - GitHub code and configs: [https://github.com/Anri-Lombard/sallm](https://github.com/Anri-Lombard/sallm) Full preprocessing pipeline (including this exact cleaning script) is in [`data/cleaning/`](https://github.com/Anri-Lombard/sallm/tree/main/data/cleaning) on GitHub. ## Citation Please cite the paper: ```bibtex @misc{lombard2026mzansitextmzansilmopencorpus, title={MzansiText and MzansiLM: An Open Corpus and Decoder-Only Language Model for South African Languages}, author={Anri Lombard and Simbarashe Mawere and Temi Aina and Ethan Wolff and Sbonelo Gumede and Elan Novick and Francois Meyer and Jan Buys}, year={2026}, eprint={2603.20732}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2603.20732}, } ``` ## License Apache License 2.0

--- language: - 南非荷兰语（Afrikaans） - 英语（English） - 北索托语（Northern Sotho） - 南索托语（Southern Sotho） - 斯威士语（Swati） - 茨瓦纳语（Tswana） - 聪加语（Tsonga） - 文达语（Venda） - 科萨语（Xhosa） - 祖鲁语（Zulu） - 南恩德贝莱语（Southern Ndebele） tags: - 预训练 - 已分词 - 南非语言 - 多语言 - mzansitext license: apache-2.0 --- # MzansiText 分词版 **MzansiText**的可直接用于训练的分词版本，已被切分为上下文长度为2048个Token的样本块。 [![GitHub](https://img.shields.io/badge/GitHub-Anri--Lombard/sallm-blue)](https://github.com/Anri-Lombard/sallm) [![Paper](https://img.shields.io/badge/Paper-arXiv_2603.20732-red.svg)](https://arxiv.org/abs/2603.20732) [![Model](https://img.shields.io/badge/Model-MzansiLM_125M-green)](https://huggingface.co/anrilombard/mzansilm-125m) [![Collection](https://img.shields.io/badge/Collection-MzansiLM-orange)](https://huggingface.co/collections/anrilombard/mzansilm-69635ca7b60efedb9dfcb09e) ## 数据集详情 - 分词器：自定义字节对编码（BPE），词汇表规模为65536 - 分块策略：每个样本包含2048个Token，文档间以结束符（EOS）作为分隔符 - 数据结构： json { "input_ids": ["int"], "lang": "string" } ### 数据集划分规模 | 数据集划分 | 样本数量 | |---|---:| | 训练集 | 3,943,584 | | 验证集 | 19,379 | | 测试集 | 19,341 | ## 使用方法 python from datasets import load_dataset ds = load_dataset("anrilombard/mzansi-text-tokenized", split="train") print(ds[0].keys()) ## 相关发布项目 - 论文：[arXiv:2603.20732](https://arxiv.org/abs/2603.20732) - 模型：[anrilombard/mzansilm-125m](https://huggingface.co/anrilombard/mzansilm-125m) - 原始语料库：[anrilombard/mzansi-text](https://huggingface.co/datasets/anrilombard/mzansi-text) - GitHub代码与配置文件：[https://github.com/Anri-Lombard/sallm](https://github.com/Anri-Lombard/sallm) 完整的预处理流水线（包含本数据集所用的清洗脚本）可在GitHub仓库的[`data/cleaning/`](https://github.com/Anri-Lombard/sallm/tree/main/data/cleaning)目录中获取。 ## 引用方式请引用如下论文： bibtex @misc{lombard2026mzansitextmzansilmopencorpus, title={MzansiText and MzansiLM: An Open Corpus and Decoder-Only Language Model for South African Languages}, author={Anri Lombard and Simbarashe Mawere and Temi Aina and Ethan Wolff and Sbonelo Gumede and Elan Novick and Francois Meyer and Jan Buys}, year={2026}, eprint={2603.20732}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2603.20732}, } ## 许可证 Apache许可证2.0

提供机构：

anrilombard

5,000+

优质数据集

54 个

任务类型

进入经典数据集