Data for the Journal Paper Machine Learning-Based Context-Aware Lemmatization for Low-Resource Languages: A Case Study of Setswana
收藏Figshare2025-03-02 更新2026-04-28 收录
下载链接:
https://figshare.com/articles/dataset/Data_for_the_Journal_Paper_Machine_Learning-Based_Context-Aware_Lemmatization_for_Low-Resource_Languages_A_Case_Study_of_Setswana/28519496
下载链接
链接失效反馈官方服务:
资源简介:
Efficient natural language processing (NLP) tools for Setswana are essential for improving human-machine interaction, yet the language remains underrepresented in computational linguistics due to its complex morphology and limited linguistic resources. This study introduces a context-aware machine-learning-based lemmatization model for Setswana, addressing challenges in word sense disambiguation and morphological analysis. Unlike previous rule-based lemmatizers, which process words in isolation, this model incorporates contextual information using Naïve Bayes (NB) and N-gram embeddings to improve lemma prediction accuracy. The proposed model was trained and evaluated using a manually annotated Setswana corpus, integrating part-of-speech (POS) tagging and named entity recognition (NER) as key linguistic features. Performance evaluation, based on accuracy (70.32%), precision (70%), recall (65%), and F1-score (66%), demonstrates the model’s effectiveness in resolving polysemous words, a challenge not addressed by existing Setswana lemmatization approaches. Comparative analysis with prior studies highlights that machine-learning models outperform rule-based approaches in capturing contextual dependencies, although dataset size and feature selection remain critical to performance improvement. This research marks a significant advancement in Setswana NLP, establishing a foundation for future hybrid models that integrate deep learning and rule-based techniques for enhanced accuracy. The study contributes to the development of computational tools for low-resource languages, paving the way for their inclusion in modern information retrieval, machine translation, and conversational AI systems.
创建时间:
2025-03-02



