KenLumachiQuAD - A QA dataset for Kenyan Luhya Lumarachi dialect

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://data.mendeley.com/datasets/b6bybwnpxh

下载链接

链接失效反馈

官方服务：

资源简介：

KenLumachiQuAD is a result of a project that annotated a total of 1,000 QA pairs based on 137 texts of Kenyan language of Luhya, the Lumarachi dialect. These source texts are from the text data collected by the Kenyan languages corpus, Kencorpus project (https://kencorpus.maseno.ac.ke/corpus-datasets/) [1]. The total Luhya Lumarachi texts available in the Kencorpus project were 483 texts. We annotated each of the selected 137 texts with at least 5 QA pairs. The KenLumachiQuAD QA dataset is available for download as one single CSV file. Each row on the CSV file shows the reference number of the source text and the associated QA pair for that text. The updated version has converted all texts into lowercase for ease of processing. The columns are on the CSV file are: ‘Story_ID’ to represent the source text from the Kencorpus project, where the QA pairs are derived. The column labeled ‘Q’ contains the question text, while the column labeled ‘A’ contains the answer text. This QA dataset is a gold standard dataset annotated by human annotators who are natives of the language. It was formulated using the same modalities and quality assurance checks of a similar project that was done for the low resource language of Kiswahili [2]. This QA dataset is useful for testing machine learning QA systems for the low-resource language of Luhya, specifically the Lumarachi dialect that is predominantly spoken in Western Kenya. A semantic network approach to the QA task as applied to the Kiswahili language [2] is currently being tested on this dataset to confirm if such approach can be applicable, in such cases where there is little training data (source texts) to otherwise train deep learning systems. [1] Wanjawa, B., Wanzare, L., Indede, F., McOnyango, O., Ombui, E., & Muchemi, L. (2023). Kencorpus: A Kenyan Language Corpus of Swahili, Dholuo and Luhya for Natural Language Processing Tasks. Journal for Language Technology and Computational Linguistics, 36(2), 1–27. [2] Wanjawa, B. W., Wanzare, L. D. A., Indede, F., McOnyango, O., Muchemi, L., & Ombui, E. (2023). KenSwQuAD—A Question Answering Dataset for Swahili Low-resource Language. ACM Transactions on Asian and Low-Resource Language Information Processing, 22(4), 1–20.

创建时间：

2025-06-10

5,000+

优质数据集

54 个

任务类型

进入经典数据集