five

The Swahili Digraph Corpus

收藏
Mendeley Data2026-04-09 收录
下载链接:
https://data.mendeley.com/datasets/pttfc9cyrt
下载链接
链接失效反馈
官方服务:
资源简介:
The Swahili Digraph Corpus is a comprehensive dataset crafted to capture the diverse phonetic elements of Swahili language, serving as a critical resource for natural language processing (NLP) and machine learning research. This corpus covers a broad array of Swahili digraphs which includes “ch,” “dh,” “gh,” “kh,” “ng’,” “ny,” “sh,” “th,” and “ng” which are essential for accurately representing Swahili phonetic nuances. With a detailed annotation of each digraph's frequency across the vowels “a,” “e,” “i,” “o,” and “u,” the corpus provides an extensive foundation for model training, testing, and validation. The dataset’s distribution, including 9,483 instances of “ch” and a balanced 11,604 instances of “ng,” ensures that machine learning models can effectively generalize across vowel contexts, which is essential for robust digraph recognition. Comprising 31,197 annotated words, the corpus also includes rare digraphs like “kh” and “ng’,” allowing models to learn both common and less frequent Swahili sounds, thus supporting nuanced phonetic recognition. By integrating a rich range of Swahili phonetic patterns, the corpus enhances the development of precise, context-sensitive Swahili language processing models, advancing research in Swahili NLP.
提供机构:
Murang'a University of Technology
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作