KenSwQuAD – A Question Answering Dataset for Swahili Low Resource Language

DataONE2023-11-21 更新2024-06-08 收录

下载链接：

https://search.dataone.org/view/sha256:c5ac5737352417d3d82bc011df048d1bfc608322817799ceda9e7d4da4424b24

下载链接

链接失效反馈

官方服务：

资源简介：

This research developed a Kencorpus Swahili Question Answering Dataset KenSwQuAD from raw data of Swahili language, which is a low resource language predominantly spoken in Eastern African and also has speakers in other parts of the world. Question Answering datasets are important for machine comprehension of natural language processing tasks such as internet search and dialog systems. However, before such machine learning systems can perform these tasks, they need training data such as the gold standard Question Answering (QA) set developed in this research. The research engaged annotators to formulate question answer pairs from Swahili texts that had been collected by the Kencorpus project, a Kenyan languages corpus that collected data from three Kenyan languages. The total Swahili data collection had 2,585 texts, out of which we annotated 1,445 story texts with at least 5 QA pairs each, resulting into a final dataset of 7,526 QA pairs. A quality assurance set of 12.5% of the annotated texts was subjected to re-evaluation by different annotators who confirmed that the QA pairs were all correctly annotated. A proof of concept on applying the set to machine learning on the question answering task confirmed that the dataset can be used for such practical tasks. The research therefore developed KenSwQuAD, a question-answer dataset for Swahili that is useful to the natural language processing community who need training and gold standard sets for their machine learning applications. The research also contributed to the resourcing of the Swahili language which is important for communication around the globe. Updating this set and providing similar sets for other low resource languages is an important research area that is worthy of further research. Acknowledgement of annotators: Rose Felynix Nyaboke, Alice Gachachi Muchemi, Patrick Ndung'u, Eric Omundi Magutu, Henry Masinde, Naomi Muthoni Gitau, Mark Bwire Erusmo, Victor Orembe Wandera, Frankline Owino, Geoffrey Sagwe Ombui

本研究依托斯瓦希里语（Swahili）原始语料，构建了Kencorpus斯瓦希里语问答数据集KenSwQuAD。斯瓦希里语属于低资源语言，主要通行于东非地区，全球其他区域亦有使用者。问答数据集是自然语言处理（Natural Language Processing, NLP）中机器阅读理解任务的核心支撑数据，广泛应用于互联网搜索、对话系统等场景。然而，此类机器学习系统要完成上述任务，需依托专用训练数据，本研究构建的金标准问答（Question Answering, QA）数据集便是此类优质训练资源的代表。本研究邀请标注人员，从Kencorpus项目所采集的斯瓦希里语文本中构建问答对；Kencorpus项目是面向肯尼亚本土语言的语料库，共采集了三种肯尼亚语言的语料数据。本次采集的斯瓦希里语原始语料共计2585篇文本，其中我们针对1445篇故事文本进行标注，每篇文本至少包含5个问答对，最终得到包含7526个问答对的完整数据集。我们从标注文本中抽取12.5%作为质量评估子集，由独立标注人员进行复评，结果显示所有问答对均标注准确。将该数据集应用于问答任务的机器学习模型的概念验证实验证实，本数据集可直接用于此类实际应用场景。综上，本研究构建的KenSwQuAD斯瓦希里语问答数据集，可为自然语言处理领域研究者提供机器学习应用所需的训练数据与金标准数据集，具有重要的学术与应用价值。本研究同时推动了斯瓦希里语的语料资源建设，而该语言对于全球跨语言交流均具有重要意义。对本数据集进行迭代更新，并为其他低资源语言构建同类数据集，是一项极具研究价值的重要方向，有待进一步探索。标注人员致谢：Rose Felynix Nyaboke、Alice Gachachi Muchemi、Patrick Ndung'u、Eric Omundi Magutu、Henry Masinde、Naomi Muthoni Gitau、Mark Bwire Erusmo、Victor Orembe Wandera、Frankline Owino、Geoffrey Sagwe Ombui

创建时间：

2023-12-16

5,000+

优质数据集

54 个

任务类型

进入经典数据集