The Swahili Digraph Corpus

Name: The Swahili Digraph Corpus
Creator: Mendeley Data
License: 暂无描述

doi.org2025-03-26 收录

下载链接：

http://doi.org/10.17632/pttfc9cyrt.2

下载链接

链接失效反馈

官方服务：

资源简介：

The Swahili Digraph Corpus is a comprehensive dataset crafted to capture the diverse phonetic elements of Swahili language, serving as a critical resource for natural language processing (NLP) and machine learning research. This corpus covers a broad array of Swahili digraphs which includes “ch,” “dh,” “gh,” “kh,” “ng’,” “ny,” “sh,” “th,” and “ng” which are essential for accurately representing Swahili phonetic nuances. With a detailed annotation of each digraph's frequency across the vowels “a,” “e,” “i,” “o,” and “u,” the corpus provides an extensive foundation for model training, testing, and validation. The dataset’s distribution, including 9,483 instances of “ch” and a balanced 11,604 instances of “ng,” ensures that machine learning models can effectively generalize across vowel contexts, which is essential for robust digraph recognition. Comprising 31,197 annotated words, the corpus also includes rare digraphs like “kh” and “ng’,” allowing models to learn both common and less frequent Swahili sounds, thus supporting nuanced phonetic recognition. By integrating a rich range of Swahili phonetic patterns, the corpus enhances the development of precise, context-sensitive Swahili language processing models, advancing research in Swahili NLP.

《斯瓦希里语双字母语料库》是一项旨在全面捕捉斯瓦希里语言丰富语音元素的综合性数据集，对于自然语言处理（NLP）与机器学习研究具有至关重要的价值。该语料库涵盖了斯瓦希里语中诸如“ch”、“dh”、“gh”、“kh”、“ng’”、“ny”、“sh”、“th”以及“ng”等双字母，这些字母对于精确展现斯瓦希里语音细微差别至关重要。通过对每个双字母在元音“a”、“e”、“i”、“o”和“u”中出现的频率进行详尽标注，语料库为模型的训练、测试和验证提供了坚实的基石。数据集的分布包括9,483个“ch”实例以及平衡的11,604个“ng”实例，确保机器学习模型能够在不同的元音语境中有效泛化，这对于稳健的双字母识别至关重要。包含31,197个标注词汇的语料库还包括诸如“kh”和“ng’”等罕见双字母，使模型能够学习斯瓦希里语中常见及不常见的声音，从而支持细致入微的语音识别。通过整合丰富的斯瓦希里语音模式，该语料库促进了精确且语境敏感的斯瓦希里语言处理模型的开发，推动了斯瓦希里NLP研究的发展。

提供机构：

Mendeley Data

5,000+

优质数据集

54 个

任务类型

进入经典数据集