SpokenNativQA
收藏arXiv2025-09-30 收录
下载链接:
https://huggingface.co/datasets/QCRI/SpokenNativQA
下载链接
链接失效反馈官方服务:
资源简介:
该数据集名为SpokenNativQA,包含了大约33,000条自然口语提问和回答,覆盖多种语言。该数据集旨在评估大型语言模型在实际对话环境中的表现,考虑到了语音变异性、口音和语言多样性。其中包含了阿拉伯语母语者和流利英语者的录音,每个问题由十位说话者录制。此外,数据集还包含了一套阿拉伯语和英语的测试数据。总样本量约为33,000条,录音时长约30小时,其任务是口语问答(Spoken Question-Answering)。
The dataset is named SpokenNativQA, containing approximately 33,000 pairs of natural spoken questions and answers across multiple languages. This dataset is designed to evaluate the performance of large language models (LLMs) in real conversational scenarios, taking into account speech variability, accents and linguistic diversity. It includes recordings from native Arabic speakers and fluent English speakers, with each question recorded by ten speakers. Additionally, the dataset also contains a set of test data in Arabic and English. The total sample size is about 33,000 entries, with a total recording duration of approximately 30 hours, and its target task is Spoken Question-Answering.
提供机构:
Qatar Computing Research Institute (QCRI)



