voidful/NMSQA

Name: voidful/NMSQA
Creator: voidful
Published: 2023-04-04 04:46:23
License: 暂无描述

Hugging Face2023-04-04 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/voidful/NMSQA

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - crowdsourced - machine-generated language_creators: - expert-generated - machine-generated - crowdsourced language: - en license: [] multilinguality: - monolingual size_categories: - unknown source_datasets: - original task_categories: - question-answering - automatic-speech-recognition task_ids: - abstractive-qa pretty_name: NMSQA tags: - speech-recognition --- # Dataset Card for NMSQA(Natural Multi-speaker Spoken Question Answering) ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - Homepage: https://github.com/DanielLin94144/DUAL-textless-SQA - Repository: https://github.com/DanielLin94144/DUAL-textless-SQA - Paper: https://arxiv.org/abs/2203.04911 - Leaderboard: - Point of Contact: Download audio data: [https://huggingface.co/datasets/voidful/NMSQA/resolve/main/nmsqa_audio.tar.gz](https://huggingface.co/datasets/voidful/NMSQA/resolve/main/nmsqa_audio.tar.gz) Unzip audio data: `tar -xf nmsqa_audio.tar.gz` ### Dataset Summary The Natural Multi-speaker Spoken Question Answering (NMSQA) dataset is designed for the task of textless spoken question answering. It is based on the SQuAD dataset and contains spoken questions and passages. The dataset includes the original text, transcriptions, and audio files of the spoken content. This dataset is created to evaluate the performance of models on textless spoken question answering tasks. ### Supported Tasks and Leaderboards The primary task supported by this dataset is textless spoken question answering, where the goal is to answer questions based on spoken passages without relying on textual information. The dataset can also be used for automatic speech recognition tasks. ### Languages The dataset is in English. ## Dataset Structure ### Data Instances Each instance in the dataset contains the following fields: - id: Unique identifier for the instance - title: The title of the passage - context: The passage text - question: The question text - - answer_start: The start index of the answer in the text - audio_full_answer_end: The end position of the audio answer in seconds - audio_full_answer_start: The start position of the audio answer in seconds - audio_full_neg_answer_end: The end position of the audio answer in seconds for an incorrect answer with the same words - audio_full_neg_answer_start: The start position of the audio answer in seconds for an incorrect answer with the same words - audio_segment_answer_end: The end position of the audio answer in seconds for the segment - audio_segment_answer_start: The start position of the audio answer in seconds for the segment - text: The answer text - content_segment_audio_path: The audio path for the content segment - content_full_audio_path: The complete audio path for the content - content_audio_sampling_rate: The audio sampling rate - content_audio_speaker: The audio speaker - content_segment_text: The segment text of the content - content_segment_normalized_text: The normalized text for generating audio - question_audio_path: The audio path for the question - question_audio_sampling_rate: The audio sampling rate - question_audio_speaker: The audio speaker - question_normalized_text: The normalized text for generating audio ### Data Fields The dataset includes the following data fields: - id - title - context - question - answers - content_segment_audio_path - content_full_audio_path - content_audio_sampling_rate - content_audio_speaker - content_segment_text - content_segment_normalized_text - question_audio_path - question_audio_sampling_rate - question_audio_speaker - question_normalized_text ### Data Splits The dataset is split into train, dev, and test sets. ## Dataset Creation ### Curation Rationale The NMSQA dataset is created to address the challenge of textless spoken question answering, where the model must answer questions based on spoken passages without relying on textual information. ### Source Data The NMSQA dataset is based on the SQuAD dataset, with spoken questions and passages created from the original text data. #### Initial Data Collection and Normalization The initial data collection involved converting the original SQuAD dataset's text-based questions and passages into spoken audio files. The text was first normalized, and then audio files were generated using text-to-speech methods. #### Who are the source language producers? The source language producers are the creators of the SQuAD dataset and the researchers who generated the spoken audio files for the NMSQA dataset. ### Annotations #### Annotation process The annotations for the NMSQA dataset are derived from the original SQuAD dataset. Additional annotations, such as audio start and end positions for correct and incorrect answers, as well as audio file paths and speaker information, are added by the dataset creators. #### Who are the annotators? The annotators for the NMSQA dataset are the creators of the SQuAD dataset and the researchers who generated the spoken audio files and additional annotations for the NMSQA dataset. ### Personal and Sensitive Information The dataset does not contain any personal or sensitive information. ## Considerations for Using the Data ### Social Impact of Dataset The NMSQA dataset contributes to the development and evaluation of models for textless spoken question answering tasks, which can lead to advancements in natural language processing and automatic speech recognition. Applications of these technologies can improve accessibility and convenience in various domains, such as virtual assistants, customer service, and voice-controlled devices. ### Discussion of Biases The dataset inherits potential biases from the original SQuAD dataset, which may include biases in the selection of passages, questions, and answers. Additionally, biases may be introduced in the text-to-speech process and the choice of speakers used to generate the spoken audio files. ### Other Known Limitations As the dataset is based on the SQuAD dataset, it shares the same limitations, including the fact that it is limited to the English language and mainly focuses on factual questions. Furthermore, the dataset may not cover a wide range of accents, dialects, or speaking styles. ## Additional Information ### Dataset Curators The NMSQA dataset is curated by Guan-Ting Lin, Yung-Sung Chuang, Ho-Lam Chung, Shu-Wen Yang, Hsuan-Jui Chen, Shang-Wen Li, Abdelrahman Mohamed, Hung-Yi Lee, and Lin-Shan Lee. ### Licensing Information The licensing information for the dataset is not explicitly mentioned. ### Citation Information ```css @article{lin2022dual, title={DUAL: Textless Spoken Question Answering with Speech Discrete Unit Adaptive Learning}, author={Lin, Guan-Ting and Chuang, Yung-Sung and Chung, Ho-Lam and Yang, Shu-wen and Chen, Hsuan-Jui and Li, Shang-Wen and Mohamed, Abdelrahman and Lee, Hung-yi and Lee, Lin-shan}, journal={arXiv preprint arXiv:2203.04911}, year={2022} } ``` ### Contributions Thanks to [@voidful](https://github.com/voidful) for adding this dataset.

提供机构：

voidful

原始信息汇总

数据集卡片：NMSQA（自然多说话人口语问答）

数据集描述

数据集摘要

自然多说话人口语问答（NMSQA）数据集旨在用于无文本口语问答任务。该数据集基于SQuAD数据集，包含口语问题和段落。数据集包括原始文本、转录和口语内容的音频文件。该数据集旨在评估模型在无文本口语问答任务上的性能。

支持的任务和排行榜

该数据集主要支持的任务是无文本口语问答，目标是在不依赖文本信息的情况下，根据口语段落回答问题。该数据集还可用于自动语音识别任务。

语言

数据集为英语。

数据集结构

数据实例

每个实例包含以下字段：

id: 实例的唯一标识符
title: 段落的标题
context: 段落文本
question: 问题文本
answer_start: 答案在文本中的起始索引
audio_full_answer_end: 音频答案的结束位置（秒）
audio_full_answer_start: 音频答案的起始位置（秒）
audio_full_neg_answer_end: 错误答案的音频结束位置（秒）
audio_full_neg_answer_start: 错误答案的音频起始位置（秒）
audio_segment_answer_end: 段落音频答案的结束位置（秒）
audio_segment_answer_start: 段落音频答案的起始位置（秒）
text: 答案文本
content_segment_audio_path: 内容段落的音频路径
content_full_audio_path: 内容完整音频路径
content_audio_sampling_rate: 音频采样率
content_audio_speaker: 音频说话人
content_segment_text: 内容段落文本
content_segment_normalized_text: 用于生成音频的规范化文本
question_audio_path: 问题音频路径
question_audio_sampling_rate: 问题音频采样率
question_audio_speaker: 问题音频说话人
question_normalized_text: 用于生成音频的规范化文本

数据字段

数据集包含以下字段：

id
title
context
question
answers
content_segment_audio_path
content_full_audio_path
content_audio_sampling_rate
content_audio_speaker
content_segment_text
content_segment_normalized_text
question_audio_path
question_audio_sampling_rate
question_audio_speaker
question_normalized_text

数据分割

数据集分为训练集、开发集和测试集。

数据集创建

策划理由

NMSQA数据集旨在解决无文本口语问答的挑战，模型必须在不依赖文本信息的情况下，根据口语段落回答问题。

源数据

NMSQA数据集基于SQuAD数据集，通过将原始文本数据转换为口语问题和段落来创建。

初始数据收集和规范化

初始数据收集涉及将SQuAD数据集的文本问题和段落转换为口语音频文件。文本首先进行规范化，然后使用文本到语音的方法生成音频文件。

源语言生产者

源语言生产者是SQuAD数据集的创建者和为NMSQA数据集生成口语音频文件的研究人员。

注释

注释过程

NMSQA数据集的注释源自原始SQuAD数据集。额外的注释，如正确和错误答案的音频起始和结束位置，以及音频文件路径和说话人信息，由数据集创建者添加。

注释者

NMSQA数据集的注释者是SQuAD数据集的创建者和为NMSQA数据集生成口语音频文件及额外注释的研究人员。

个人和敏感信息

数据集不包含任何个人或敏感信息。

使用数据的注意事项

数据集的社会影响

NMSQA数据集有助于开发和评估无文本口语问答任务的模型，这可以推动自然语言处理和自动语音识别的进步。这些技术的应用可以提高各种领域的可访问性和便利性，如虚拟助手、客户服务和语音控制设备。

偏见的讨论

数据集继承了原始SQuAD数据集的潜在偏见，可能包括段落、问题和答案选择的偏见。此外，文本到语音过程中和生成口语音频文件所用说话人的选择也可能引入偏见。

其他已知限制

由于数据集基于SQuAD数据集，因此共享相同的限制，包括仅限于英语语言和主要关注事实问题。此外，数据集可能未涵盖广泛的口音、方言或说话风格。

附加信息

数据集策展人

NMSQA数据集由Guan-Ting Lin、Yung-Sung Chuang、Ho-Lam Chung、Shu-Wen Yang、Hsuan-Jui Chen、Shang-Wen Li、Abdelrahman Mohamed、Hung-Yi Lee和Lin-Shan Lee策展。

许可信息

数据集的许可信息未明确提及。

引用信息

css @article{lin2022dual, title={DUAL: Textless Spoken Question Answering with Speech Discrete Unit Adaptive Learning}, author={Lin, Guan-Ting and Chuang, Yung-Sung and Chung, Ho-Lam and Yang, Shu-wen and Chen, Hsuan-Jui and Li, Shang-Wen and Mohamed, Abdelrahman and Lee, Hung-yi and Lee, Lin-shan}, journal={arXiv preprint arXiv:2203.04911}, year={2022} }

贡献

感谢@voidful添加此数据集。

5,000+

优质数据集

54 个

任务类型

进入经典数据集