five

anrilombard/mzansi-text-tokenized

收藏
Hugging Face2026-03-25 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/anrilombard/mzansi-text-tokenized
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - af - en - nso - sot - ssw - tsn - tso - ven - xho - zul - nbl tags: - pretraining - tokenized - south-african-languages - multilingual - mzansitext license: apache-2.0 --- # MzansiText Tokenized Ready-to-train tokenized version of **MzansiText**, chunked to a context length of 2048 tokens. [![GitHub](https://img.shields.io/badge/GitHub-Anri--Lombard/sallm-blue)](https://github.com/Anri-Lombard/sallm) [![Paper](https://img.shields.io/badge/Paper-arXiv_2603.20732-red.svg)](https://arxiv.org/abs/2603.20732) [![Model](https://img.shields.io/badge/Model-MzansiLM_125M-green)](https://huggingface.co/anrilombard/mzansilm-125m) [![Collection](https://img.shields.io/badge/Collection-MzansiLM-orange)](https://huggingface.co/collections/anrilombard/mzansilm-69635ca7b60efedb9dfcb09e) ## Dataset Details - Tokenizer: custom BPE, `65536` vocabulary - Chunking: `2048` tokens per example with EOS separators between documents - Schema: ```json { "input_ids": ["int"], "lang": "string" } ``` ### Split Sizes | Split | Examples | |---|---:| | Train | 3,943,584 | | Validation | 19,379 | | Test | 19,341 | ## Usage ```python from datasets import load_dataset ds = load_dataset("anrilombard/mzansi-text-tokenized", split="train") print(ds[0].keys()) ``` ## Related Releases - Paper: [arXiv:2603.20732](https://arxiv.org/abs/2603.20732) - Model: [anrilombard/mzansilm-125m](https://huggingface.co/anrilombard/mzansilm-125m) - Raw corpus: [anrilombard/mzansi-text](https://huggingface.co/datasets/anrilombard/mzansi-text) - GitHub code and configs: [https://github.com/Anri-Lombard/sallm](https://github.com/Anri-Lombard/sallm) Full preprocessing pipeline (including this exact cleaning script) is in [`data/cleaning/`](https://github.com/Anri-Lombard/sallm/tree/main/data/cleaning) on GitHub. ## Citation Please cite the paper: ```bibtex @misc{lombard2026mzansitextmzansilmopencorpus, title={MzansiText and MzansiLM: An Open Corpus and Decoder-Only Language Model for South African Languages}, author={Anri Lombard and Simbarashe Mawere and Temi Aina and Ethan Wolff and Sbonelo Gumede and Elan Novick and Francois Meyer and Jan Buys}, year={2026}, eprint={2603.20732}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2603.20732}, } ``` ## License Apache License 2.0

--- language: - 南非荷兰语(Afrikaans) - 英语(English) - 北索托语(Northern Sotho) - 南索托语(Southern Sotho) - 斯威士语(Swati) - 茨瓦纳语(Tswana) - 聪加语(Tsonga) - 文达语(Venda) - 科萨语(Xhosa) - 祖鲁语(Zulu) - 南恩德贝莱语(Southern Ndebele) tags: - 预训练 - 已分词 - 南非语言 - 多语言 - mzansitext license: apache-2.0 --- # MzansiText 分词版 **MzansiText**的可直接用于训练的分词版本,已被切分为上下文长度为2048个Token的样本块。 [![GitHub](https://img.shields.io/badge/GitHub-Anri--Lombard/sallm-blue)](https://github.com/Anri-Lombard/sallm) [![Paper](https://img.shields.io/badge/Paper-arXiv_2603.20732-red.svg)](https://arxiv.org/abs/2603.20732) [![Model](https://img.shields.io/badge/Model-MzansiLM_125M-green)](https://huggingface.co/anrilombard/mzansilm-125m) [![Collection](https://img.shields.io/badge/Collection-MzansiLM-orange)](https://huggingface.co/collections/anrilombard/mzansilm-69635ca7b60efedb9dfcb09e) ## 数据集详情 - 分词器:自定义字节对编码(BPE),词汇表规模为65536 - 分块策略:每个样本包含2048个Token,文档间以结束符(EOS)作为分隔符 - 数据结构: json { "input_ids": ["int"], "lang": "string" } ### 数据集划分规模 | 数据集划分 | 样本数量 | |---|---:| | 训练集 | 3,943,584 | | 验证集 | 19,379 | | 测试集 | 19,341 | ## 使用方法 python from datasets import load_dataset ds = load_dataset("anrilombard/mzansi-text-tokenized", split="train") print(ds[0].keys()) ## 相关发布项目 - 论文:[arXiv:2603.20732](https://arxiv.org/abs/2603.20732) - 模型:[anrilombard/mzansilm-125m](https://huggingface.co/anrilombard/mzansilm-125m) - 原始语料库:[anrilombard/mzansi-text](https://huggingface.co/datasets/anrilombard/mzansi-text) - GitHub代码与配置文件:[https://github.com/Anri-Lombard/sallm](https://github.com/Anri-Lombard/sallm) 完整的预处理流水线(包含本数据集所用的清洗脚本)可在GitHub仓库的[`data/cleaning/`](https://github.com/Anri-Lombard/sallm/tree/main/data/cleaning)目录中获取。 ## 引用方式 请引用如下论文: bibtex @misc{lombard2026mzansitextmzansilmopencorpus, title={MzansiText and MzansiLM: An Open Corpus and Decoder-Only Language Model for South African Languages}, author={Anri Lombard and Simbarashe Mawere and Temi Aina and Ethan Wolff and Sbonelo Gumede and Elan Novick and Francois Meyer and Jan Buys}, year={2026}, eprint={2603.20732}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2603.20732}, } ## 许可证 Apache许可证2.0
提供机构:
anrilombard
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作