five

KenLumachiQuAD - A QA dataset for Kenyan Luhya Lumarachi dialect

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://data.mendeley.com/datasets/b6bybwnpxh
下载链接
链接失效反馈
官方服务:
资源简介:
KenLumachiQuAD is a result of a project that annotated a total of 1,000 QA pairs based on 137 texts of Kenyan language of Luhya, the Lumarachi dialect. These source texts are from the text data collected by the Kenyan languages corpus, Kencorpus project (https://kencorpus.maseno.ac.ke/corpus-datasets/) [1]. The total Luhya Lumarachi texts available in the Kencorpus project were 483 texts. We annotated each of the selected 137 texts with at least 5 QA pairs. The KenLumachiQuAD QA dataset is available for download as one single CSV file. Each row on the CSV file shows the reference number of the source text and the associated QA pair for that text. The updated version has converted all texts into lowercase for ease of processing. The columns are on the CSV file are: ‘Story_ID’ to represent the source text from the Kencorpus project, where the QA pairs are derived. The column labeled ‘Q’ contains the question text, while the column labeled ‘A’ contains the answer text. This QA dataset is a gold standard dataset annotated by human annotators who are natives of the language. It was formulated using the same modalities and quality assurance checks of a similar project that was done for the low resource language of Kiswahili [2]. This QA dataset is useful for testing machine learning QA systems for the low-resource language of Luhya, specifically the Lumarachi dialect that is predominantly spoken in Western Kenya. A semantic network approach to the QA task as applied to the Kiswahili language [2] is currently being tested on this dataset to confirm if such approach can be applicable, in such cases where there is little training data (source texts) to otherwise train deep learning systems. [1] Wanjawa, B., Wanzare, L., Indede, F., McOnyango, O., Ombui, E., & Muchemi, L. (2023). Kencorpus: A Kenyan Language Corpus of Swahili, Dholuo and Luhya for Natural Language Processing Tasks. Journal for Language Technology and Computational Linguistics, 36(2), 1–27. [2] Wanjawa, B. W., Wanzare, L. D. A., Indede, F., McOnyango, O., Muchemi, L., & Ombui, E. (2023). KenSwQuAD—A Question Answering Dataset for Swahili Low-resource Language. ACM Transactions on Asian and Low-Resource Language Information Processing, 22(4), 1–20.
创建时间:
2025-06-10
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作