five

sagepond/lug_web

收藏
Hugging Face2025-07-18 更新2025-11-02 收录
下载链接:
https://hf-mirror.com/datasets/sagepond/lug_web
下载链接
链接失效反馈
官方服务:
资源简介:
Luganda预训练数据集是一个不断增长的卢干达语(乌干达的主要班图语之一)文本语料库,由SAGE POND策划。该数据集旨在为语言模型预训练、文本生成以及卢干达自然语言处理研究任务提供支持。由于卢干达语在自然语言处理领域资源较少,该数据集旨在捕捉该语言的语言丰富性、文化表达和方言多样性。数据集涵盖多个领域,包括口头文学、社交媒体、新闻、宗教文本和学术材料,并将不断更新,以包含方言和地区差异、现代口语和新兴术语、谚语、惯用语和文化重要表达,以及特定领域的语料库。

The Luganda Pretraining Dataset is a continuously growing corpus of text in Luganda, one of Uganda’s major Bantu languages, curated by SAGE POND. This dataset is designed to support language model pretraining, text generation, and other Luganda NLP research tasks. Given Luganda’s low-resource status in the NLP landscape, the dataset aims to capture the linguistic richness, cultural expressions, and dialectal diversity of the language. It encompasses multiple domains including oral literature, social media, news, religious texts, and academic materials, and will be continuously updated to include dialectal and regional variations, contemporary colloquialisms and emerging terms, proverbs, idioms, and culturally significant expressions, as well as domain-specific corpora such as legal, health, finance, etc.
提供机构:
sagepond
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作