Spoken-SQuAD

Name: Spoken-SQuAD
Creator: SQuAD benchmark dataset
License: 暂无描述

arXiv2025-09-30 收录

下载链接：

https://github.com/chiahsuan156/spoken-squad

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是一个大型的听力理解语料库，其中的问题以文本形式存在，并利用谷歌文本转语音技术（Google TTS）将其转换成了口语形式。在训练集和测试集中，单词错误率（WER）分别约为22.77%和22.73%。该数据集的规模包括37,000个训练对和5,400个测试对，其任务是进行口语问题回答。

This dataset is a large-scale listening comprehension corpus. All questions are provided in textual format and converted into spoken utterances via Google Text-to-Speech (Google TTS) technology. The Word Error Rate (WER) for the training and test sets is approximately 22.77% and 22.73%, respectively. This dataset comprises 37,000 training pairs and 5,400 test pairs, with its designated task being spoken question answering.

提供机构：

SQuAD benchmark dataset

搜集汇总

数据集介绍

背景与挑战

背景概述

Spoken-SQuAD是基于SQuAD构建的首个大规模口语问答数据集，通过将文本文章转换为语音并添加自动语音识别（ASR）转录生成，用于研究语音识别错误对听力理解的影响。数据集包含37,111个训练对和5,351个测试对，基础词错误率（WER）约为22.73%，并提供两个添加噪声的测试版本以模拟不同音频质量场景。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集