SpokenNativQA

Name: SpokenNativQA
Creator: maas
Published: 2025-12-05 11:59:25
License: 暂无描述

魔搭社区2025-12-05 更新2025-06-21 收录

下载链接：

https://modelscope.cn/datasets/QCRI/SpokenNativQA

下载链接

链接失效反馈

官方服务：

资源简介：

# SpokenNativQA: Multilingual Everyday Spoken Queries for LLMs The [**SpokenNativQA**]() dataset consists of question-answer (QA) pairs, where queries are sourced from real users and answers are manually reviewed and edited. The dataset covers a diverse range of 18 topics that reflect culturally and regionally specific knowledge, as well as everyday queries. These topics include animals, business, clothing, education, events, food and drinks, general knowledge, geography, immigration, language, literature, names and persons, plants, religion, sports and games, tradition, travel, and weather. SpokenNativQA provides **multilingual test sets of everyday spoken questions** to evaluate large language models (LLMs) and speech processing systems. The dataset contains **Arabic** and **English** queries, each transcribed by multiple automatic speech recognition (ASR) systems. <img src="./spokennativqa.png" style="width: 60%;" id="title-icon"> **Note:** As a part of this repository we only shared the wav files that are only used for evaluation. For the entire dataset that we reported in the paper might be accessible after contacting with the authors. ## Directory Overview The dataset is organized into two main directories: - **`arabic_qa/`** - `spokenqa_arabic_qa_test_azure_asr.jsonl` - `spokenqa_arabic_qa_test_fanar_asr.jsonl` - `spokenqa_arabic_qa_test_google_asr.jsonl` - `spokenqa_arabic_qa_test_whisper_asr.jsonl` - `spokenqa_arabic_qa_test.jsonl` - `speech/` -- wav files - **`english_qa/`** - `spokenqa_english_qa_test_azure_asr.jsonl` - `spokenqa_english_qa_test_fanar_asr.jsonl` - `spokenqa_english_qa_test_google_asr.jsonl` - `spokenqa_english_qa_test.jsonl` - `speech/` -- wav files ### Dataset Structure and Format Each `.jsonl` file contains a list of JSON objects, one per line. The typical structure includes: - **`lang`**: The language of the spoken query (e.g., "arabic", "english"). - **`data_id`**: A unique identifier for the data instance. - **`file_name`**: The name of the audio file. - **`file_path`**: The relative path of the audio file. - **`question`**: The intended question in text form (reference). - **`answer`**: The expected answer or reference answer for the question. - **`location`**: The geographical location where the query was recorded. - **`asr_text`**: The text output from the ASR system. ### Example of a JSON Entry ```json { "lang": "arabic", "data_id": "3cdfcfd1acb722617ec8bbe6808114bc", "file_name": "3cdfcfd1acb722617ec8bbe6808114bc_1724222501325.wav", "file_path": "speech/3cdfcfd1acb722617ec8bbe6808114bc_1724222501325.wav", "question": "من هو الشاعر الذي سجن؟", "answer": "وهي القصائد التي كتبها أبو فراس الحمداني فترة أسره عند الروم في سجن خرشنة، وعرفت باسم الروميات نسبة لمكان أسره، وقد تميزت هذا القصائد بجزالتها وقوتها ورصانتها، وصدق عاطفتها.", "location": "qatar", "asr_text": "من هو الشاعر الذي زعل؟" } ``` # Experimental Scripts: All of the experimental scripts are available as a part of [https://github.com/qcri/LLMeBench](https://github.com/qcri/LLMeBench) framework. # License This dataset is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0). # Citation If you are using this dataset in your research, we kindly ask that you cite our [paper](https://www.arxiv.org/pdf/2505.19163). ``` @inproceedings{alam2025spokennativqa, title = {SpokenNativQA: Multilingual Everyday Spoken Queries for LLMs}, author = {Firoj Alam and Md Arid Hasan and Shammur Absar Chowdhury}, booktitle = {Proceedings of the 26th Interspeech Conference (Interspeech 2025)}, year = {2025}, address = {Rotterdam, The Netherlands}, month = aug, organization = {ISCA}, } ```

# SpokenNativQA：面向大语言模型（Large Language Model，LLM）的多语言日常口语查询数据集 **SpokenNativQA** 数据集由问答（QA）对组成，其中查询均来自真实用户，答案均经过人工审核与编辑。该数据集涵盖18个多样化主题，既包含具有文化与地域特异性的知识，也涵盖日常查询场景，具体主题包括：动物、商业、服饰、教育、事件、食品与饮品、常识、地理、移民、语言、文学、姓名与人物、植物、宗教、体育与竞技、传统、旅行以及天气。 SpokenNativQA 提供**日常口语查询的多语言测试集**，用于评估大语言模型与语音处理系统。该数据集包含**阿拉伯语**与**英语**两类查询，每类查询均经多种自动语音识别（Automatic Speech Recognition，ASR）系统转录。 <img src="./spokennativqa.png" style="width: 60%;" id="title-icon"> **注意：** 本仓库仅共享用于评估的WAV音频文件。论文中报道的完整数据集需联系作者方可获取。 ## 目录结构概览数据集分为两个主要目录： - **`arabic_qa/`** - `spokenqa_arabic_qa_test_azure_asr.jsonl` - `spokenqa_arabic_qa_test_fanar_asr.jsonl` - `spokenqa_arabic_qa_test_google_asr.jsonl` - `spokenqa_arabic_qa_test_whisper_asr.jsonl` - `spokenqa_arabic_qa_test.jsonl` - `speech/` —— 存放WAV音频文件 - **`english_qa/`** - `spokenqa_english_qa_test_azure_asr.jsonl` - `spokenqa_english_qa_test_fanar_asr.jsonl` - `spokenqa_english_qa_test_google_asr.jsonl` - `spokenqa_english_qa_test_whisper_asr.jsonl` - `spokenqa_english_qa_test.jsonl` - `speech/` —— 存放WAV音频文件 ### 数据集结构与格式每个`.jsonl`文件按行存储一组JSON对象，其典型结构包含以下字段： - **`lang`**：口语查询的语言（例如：`arabic`、`english`）。 - **`data_id`**：数据实例的唯一标识符。 - **`file_name`**：音频文件名。 - **`file_path`**：音频文件的相对路径。 - **`question`**：文本形式的标准查询问题（参考文本）。 - **`answer`**：该查询对应的预期答案或参考标准答案。 - **`location`**：查询录制时的地理位置。 - **`asr_text`**：自动语音识别系统输出的文本结果。 ### JSON示例条目 json { "lang": "arabic", "data_id": "3cdfcfd1acb722617ec8bbe6808114bc", "file_name": "3cdfcfd1acb722617ec8bbe6808114bc_1724222501325.wav", "file_path": "speech/3cdfcfd1acb722617ec8bbe6808114bc_1724222501325.wav", "question": "من هو الشاعر الذي سجن؟", "answer": "وهي القصائد التي كتبها أبو فراس الحمداني فترة أسره عند الروم في سجن خرشنة، وعرفت باسم الروميات نسبة لمكان أسره، وقد تميزت هذا القصائد بجزالتها وقوتها ورصانتها، وصدق عاطفتها.", "location": "qatar", "asr_text": "من هو الشاعر الذي زعل؟" } ## 实验脚本所有实验脚本均作为[https://github.com/qcri/LLMeBench](https://github.com/qcri/LLMeBench)框架的一部分对外发布。 ## 许可证本数据集采用知识共享署名-非商业性使用-相同方式共享4.0国际许可协议（Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International，CC BY-NC-SA 4.0）进行授权。 ## 引用声明若您在研究中使用本数据集，请引用我们的[论文](https://www.arxiv.org/pdf/2505.19163)。 @inproceedings{alam2025spokennativqa, title = {SpokenNativQA: Multilingual Everyday Spoken Queries for LLMs}, author = {Firoj Alam and Md Arid Hasan and Shammur Absar Chowdhury}, booktitle = {Proceedings of the 26th Interspeech Conference (Interspeech 2025)}, year = {2025}, address = {Rotterdam, The Netherlands}, month = aug, organization = {ISCA}, }

提供机构：

maas

创建时间：

2025-06-17

5,000+

优质数据集

54 个

任务类型

进入经典数据集