DanMM96/Tshiluba-dataset

Name: DanMM96/Tshiluba-dataset
Creator: DanMM96
Published: 2025-11-04 21:51:35
License: 暂无描述

Hugging Face2025-11-04 更新2025-11-15 收录

下载链接：

https://hf-mirror.com/datasets/DanMM96/Tshiluba-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

Tshiluba文本语料库是一个针对Tshiluba（钦卢巴）语言的单语语料库，Tshiluba是一种在刚果民主共和国使用的班图语，使用者超过700万人。作为一种低资源语言，Tshiluba的公开可用的自然语言处理（NLP）数据集非常有限。这个语料库从公开可用的文本资源编译而成，用于支持语言建模、文本分类和其他NLP任务。语料库目前包括训练集、验证集和测试集，格式为JSONL，包含句子及其ID。目前的数据主要来源于公共领域的圣经文本和歌曲歌词，存在一定的领域偏差，并且还在不断完善中，预计下一步将添加Tshiluba和英语的双语语料库。

The Tshiluba Text Corpus is a monolingual dataset for the Tshiluba language, spoken by over 7 million people in the Democratic Republic of Congo. As a low-resource language, Tshiluba has very limited publicly available datasets for Natural Language Processing (NLP). This corpus, compiled from publicly available text sources, is intended for language modeling, text classification, and other NLP tasks. It currently consists of training, validation, and test sets formatted in JSONL, including sentences and their IDs. The data primarily comes from public domain Bible texts and song lyrics, showing domain bias and is a work in progress with plans to add a parallel Tshiluba-English corpus in the next stage.

提供机构：

DanMM96

5,000+

优质数据集

54 个

任务类型

进入经典数据集