five

NMSQA

收藏
魔搭社区2025-11-02 更新2025-03-01 收录
下载链接:
https://modelscope.cn/datasets/voidful/NMSQA
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for NMSQA(Natural Multi-speaker Spoken Question Answering) ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - Homepage: https://github.com/DanielLin94144/DUAL-textless-SQA - Repository: https://github.com/DanielLin94144/DUAL-textless-SQA - Paper: https://arxiv.org/abs/2203.04911 - Leaderboard: - Point of Contact: Download audio data: [https://huggingface.co/datasets/voidful/NMSQA/resolve/main/nmsqa_audio.tar.gz](https://huggingface.co/datasets/voidful/NMSQA/resolve/main/nmsqa_audio.tar.gz) Unzip audio data: `tar -xf nmsqa_audio.tar.gz` ### Dataset Summary The Natural Multi-speaker Spoken Question Answering (NMSQA) dataset is designed for the task of textless spoken question answering. It is based on the SQuAD dataset and contains spoken questions and passages. The dataset includes the original text, transcriptions, and audio files of the spoken content. This dataset is created to evaluate the performance of models on textless spoken question answering tasks. ### Supported Tasks and Leaderboards The primary task supported by this dataset is textless spoken question answering, where the goal is to answer questions based on spoken passages without relying on textual information. The dataset can also be used for automatic speech recognition tasks. ### Languages The dataset is in English. ## Dataset Structure ### Data Instances Each instance in the dataset contains the following fields: - id: Unique identifier for the instance - title: The title of the passage - context: The passage text - question: The question text - - answer_start: The start index of the answer in the text - audio_full_answer_end: The end position of the audio answer in seconds - audio_full_answer_start: The start position of the audio answer in seconds - audio_full_neg_answer_end: The end position of the audio answer in seconds for an incorrect answer with the same words - audio_full_neg_answer_start: The start position of the audio answer in seconds for an incorrect answer with the same words - audio_segment_answer_end: The end position of the audio answer in seconds for the segment - audio_segment_answer_start: The start position of the audio answer in seconds for the segment - text: The answer text - content_segment_audio_path: The audio path for the content segment - content_full_audio_path: The complete audio path for the content - content_audio_sampling_rate: The audio sampling rate - content_audio_speaker: The audio speaker - content_segment_text: The segment text of the content - content_segment_normalized_text: The normalized text for generating audio - question_audio_path: The audio path for the question - question_audio_sampling_rate: The audio sampling rate - question_audio_speaker: The audio speaker - question_normalized_text: The normalized text for generating audio ### Data Fields The dataset includes the following data fields: - id - title - context - question - answers - content_segment_audio_path - content_full_audio_path - content_audio_sampling_rate - content_audio_speaker - content_segment_text - content_segment_normalized_text - question_audio_path - question_audio_sampling_rate - question_audio_speaker - question_normalized_text ### Data Splits The dataset is split into train, dev, and test sets. ## Dataset Creation ### Curation Rationale The NMSQA dataset is created to address the challenge of textless spoken question answering, where the model must answer questions based on spoken passages without relying on textual information. ### Source Data The NMSQA dataset is based on the SQuAD dataset, with spoken questions and passages created from the original text data. #### Initial Data Collection and Normalization The initial data collection involved converting the original SQuAD dataset's text-based questions and passages into spoken audio files. The text was first normalized, and then audio files were generated using text-to-speech methods. #### Who are the source language producers? The source language producers are the creators of the SQuAD dataset and the researchers who generated the spoken audio files for the NMSQA dataset. ### Annotations #### Annotation process The annotations for the NMSQA dataset are derived from the original SQuAD dataset. Additional annotations, such as audio start and end positions for correct and incorrect answers, as well as audio file paths and speaker information, are added by the dataset creators. #### Who are the annotators? The annotators for the NMSQA dataset are the creators of the SQuAD dataset and the researchers who generated the spoken audio files and additional annotations for the NMSQA dataset. ### Personal and Sensitive Information The dataset does not contain any personal or sensitive information. ## Considerations for Using the Data ### Social Impact of Dataset The NMSQA dataset contributes to the development and evaluation of models for textless spoken question answering tasks, which can lead to advancements in natural language processing and automatic speech recognition. Applications of these technologies can improve accessibility and convenience in various domains, such as virtual assistants, customer service, and voice-controlled devices. ### Discussion of Biases The dataset inherits potential biases from the original SQuAD dataset, which may include biases in the selection of passages, questions, and answers. Additionally, biases may be introduced in the text-to-speech process and the choice of speakers used to generate the spoken audio files. ### Other Known Limitations As the dataset is based on the SQuAD dataset, it shares the same limitations, including the fact that it is limited to the English language and mainly focuses on factual questions. Furthermore, the dataset may not cover a wide range of accents, dialects, or speaking styles. ## Additional Information ### Dataset Curators The NMSQA dataset is curated by Guan-Ting Lin, Yung-Sung Chuang, Ho-Lam Chung, Shu-Wen Yang, Hsuan-Jui Chen, Shang-Wen Li, Abdelrahman Mohamed, Hung-Yi Lee, and Lin-Shan Lee. ### Licensing Information The licensing information for the dataset is not explicitly mentioned. ### Citation Information ```css @article{lin2022dual, title={DUAL: Textless Spoken Question Answering with Speech Discrete Unit Adaptive Learning}, author={Lin, Guan-Ting and Chuang, Yung-Sung and Chung, Ho-Lam and Yang, Shu-wen and Chen, Hsuan-Jui and Li, Shang-Wen and Mohamed, Abdelrahman and Lee, Hung-yi and Lee, Lin-shan}, journal={arXiv preprint arXiv:2203.04911}, year={2022} } ``` ### Contributions Thanks to [@voidful](https://github.com/voidful) for adding this dataset.

# NMSQA(自然多说话人口语问答)数据集卡片 ## 目录 - [目录](#table-of-contents) - [数据集描述](#dataset-description) - [数据集摘要](#dataset-summary) - [支持的任务与基准测试平台](#supported-tasks-and-leaderboards) - [语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [数据集构建初衷](#curation-rationale) - [源数据](#source-data) - [标注](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [授权信息](#licensing-information) - [引用信息](#citation-information) - [贡献](#contributions) ## 数据集描述 - 主页: https://github.com/DanielLin94144/DUAL-textless-SQA - 代码仓库: https://github.com/DanielLin94144/DUAL-textless-SQA - 相关论文: https://arxiv.org/abs/2203.04911 - 基准测试平台: - 联系方式: 下载音频数据:[https://huggingface.co/datasets/voidful/NMSQA/resolve/main/nmsqa_audio.tar.gz](https://huggingface.co/datasets/voidful/NMSQA/resolve/main/nmsqa_audio.tar.gz) 解压音频数据:`tar -xf nmsqa_audio.tar.gz` ### 数据集摘要 自然多说话人口语问答(Natural Multi-speaker Spoken Question Answering, NMSQA)数据集专为无文本口语问答任务设计。该数据集基于SQuAD数据集构建,包含口语化问题与篇章,涵盖原始文本、转录内容以及口语内容的音频文件。本数据集旨在评估模型在无文本口语问答任务中的性能表现。 ### 支持的任务与基准测试平台 本数据集支持的核心任务为无文本口语问答,其目标为无需依赖文本信息,仅基于口语化篇章回答问题。该数据集亦可用于自动语音识别任务。 ### 语言 本数据集采用英语。 ## 数据集结构 ### 数据实例 本数据集的每个数据实例包含以下字段: - id:数据实例的唯一标识符 - title:篇章标题 - context:篇章文本 - question:问题文本 - answer_start:答案在文本中的起始索引 - audio_full_answer_end:音频答案的结束位置(单位:秒) - audio_full_answer_start:音频答案的起始位置(单位:秒) - audio_full_neg_answer_end:同词错误答案的音频答案结束位置(单位:秒) - audio_full_neg_answer_start:同词错误答案的音频答案起始位置(单位:秒) - audio_segment_answer_end:片段级音频答案的结束位置(单位:秒) - audio_segment_answer_start:片段级音频答案的起始位置(单位:秒) - text:答案文本 - content_segment_audio_path:内容片段的音频路径 - content_full_audio_path:完整内容的音频路径 - content_audio_sampling_rate:音频采样率 - content_audio_speaker:音频说话人 - content_segment_text:内容片段文本 - content_segment_normalized_text:用于生成音频的归一化文本 - question_audio_path:问题的音频路径 - question_audio_sampling_rate:问题音频的采样率 - question_audio_speaker:问题音频的说话人 - question_normalized_text:用于生成问题音频的归一化文本 ### 数据字段 本数据集包含以下数据字段: - id - title - context - question - answers - content_segment_audio_path - content_full_audio_path - content_audio_sampling_rate - content_audio_speaker - content_segment_text - content_segment_normalized_text - question_audio_path - question_audio_sampling_rate - question_audio_speaker - question_normalized_text ### 数据划分 本数据集划分为训练集、开发集与测试集。 ## 数据集构建 ### 数据集构建初衷 NMSQA数据集旨在解决无文本口语问答任务的挑战,即模型需仅基于口语化篇章回答问题,无需依赖文本信息。 ### 源数据 NMSQA数据集基于SQuAD数据集构建,其口语化问题与篇章均由原始文本数据生成。 #### 初始数据收集与归一化 初始数据收集环节将原始SQuAD数据集中的文本型问题与篇章转换为口语音频文件:先对文本进行归一化处理,再通过文本转语音(Text-to-Speech, TTS)方法生成音频文件。 #### 源语言内容创作者是谁? 源语言内容创作者为SQuAD数据集的开发者,以及为NMSQA数据集生成口语音频文件的研究人员。 ### 标注 #### 标注流程 NMSQA数据集的标注源自原始SQuAD数据集。数据集创作者额外添加了正确与错误答案的音频起始、结束位置,以及音频文件路径与说话人信息等标注项。 #### 标注人员是谁? 本数据集的标注人员为SQuAD数据集的开发者,以及为NMSQA数据集生成口语音频文件并添加额外标注的研究人员。 ### 个人与敏感信息 本数据集未包含任何个人或敏感信息。 ## 数据集使用注意事项 ### 数据集的社会影响 NMSQA数据集有助于推动无文本口语问答任务模型的开发与评估,可助力自然语言处理与自动语音识别领域的技术进步。此类技术的应用可在虚拟助手、客服以及语音控制设备等多个领域提升使用便捷性与无障碍性。 ### 偏差讨论 本数据集继承了原始SQuAD数据集的潜在偏差,包括篇章、问题与答案选择层面的偏差。此外,文本转语音流程以及生成口语音频文件所选用的说话人,也可能引入额外偏差。 ### 其他已知局限性 由于本数据集基于SQuAD构建,因此继承了其部分局限性:仅支持英语,且主要聚焦于事实型问题。此外,该数据集未覆盖广泛的口音、方言或说话风格。 ## 附加信息 ### 数据集维护者 NMSQA数据集的维护者为Guan-Ting Lin、Yung-Sung Chuang、Ho-Lam Chung、Shu-Wen Yang、Hsuan-Jui Chen、Shang-Wen Li、Abdelrahman Mohamed、Hung-Yi Lee以及Lin-Shan Lee。 ### 授权信息 本数据集的授权信息未明确说明。 ### 引用信息 css @article{lin2022dual, title={DUAL: Textless Spoken Question Answering with Speech Discrete Unit Adaptive Learning}, author={Lin, Guan-Ting and Chuang, Yung-Sung and Chung, Ho-Lam and Yang, Shu-wen and Chen, Hsuan-Jui and Li, Shang-Wen and Mohamed, Abdelrahman and Lee, Hung-yi and Lee, Lin-shan}, journal={arXiv preprint arXiv:2203.04911}, year={2022} } ### 贡献 感谢 [@voidful](https://github.com/voidful) 为本数据集提供支持。
提供机构:
maas
创建时间:
2025-03-01
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作