The Swahili Digraph Corpus
收藏Mendeley Data2026-04-09 收录
下载链接:
https://data.mendeley.com/datasets/pttfc9cyrt
下载链接
链接失效反馈官方服务:
资源简介:
The Swahili Digraph Corpus is a comprehensive dataset crafted to capture the diverse phonetic elements of Swahili language, serving as a critical resource for natural language processing (NLP) and machine learning research. This corpus covers a broad array of Swahili digraphs which includes “ch,” “dh,” “gh,” “kh,” “ng’,” “ny,” “sh,” “th,” and “ng” which are essential for accurately representing Swahili phonetic nuances. With a detailed annotation of each digraph's frequency across the vowels “a,” “e,” “i,” “o,” and “u,” the corpus provides an extensive foundation for model training, testing, and validation. The dataset’s distribution, including 9,483 instances of “ch” and a balanced 11,604 instances of “ng,” ensures that machine learning models can effectively generalize across vowel contexts, which is essential for robust digraph recognition. Comprising 31,197 annotated words, the corpus also includes rare digraphs like “kh” and “ng’,” allowing models to learn both common and less frequent Swahili sounds, thus supporting nuanced phonetic recognition. By integrating a rich range of Swahili phonetic patterns, the corpus enhances the development of precise, context-sensitive Swahili language processing models, advancing research in Swahili NLP.
提供机构:
Murang'a University of Technology



