Wiki-En-ASR-Adapt

Name: Wiki-En-ASR-Adapt
Creator: 莫斯科物理技术学院
Published: 2023-09-29 22:18:59
License: 暂无描述

arXiv2023-09-29 更新2024-06-21 收录

下载链接：

https://huggingface.co/datasets/bene-ges/wiki-en-asr-adapt

下载链接

链接失效反馈

官方服务：

资源简介：

Wiki-En-ASR-Adapt是由莫斯科物理技术学院创建的一个大规模合成数据集，专注于英语自动语音识别（ASR）的上下文拼写检查定制。该数据集包含2600万个真实示例，模拟了复杂的偏置列表，用于定制任务。数据集利用Wikipedia标题和文本片段，通过TTS+ASR技术生成‘损坏’的偏置短语。创建过程中，使用GIZA++工具进行字符级对齐，以增加损坏库存的多样性。该数据集适用于训练不同架构的模型，特别是在解决ASR系统中罕见和超出词汇表（OOV）短语的正确识别问题。

Wiki-En-ASR-Adapt is a large-scale synthetic dataset created by the Moscow Institute of Physics and Technology, focused on customizing contextual spell checking for English automatic speech recognition (ASR). It contains 26 million real-world samples that simulate complex biased lists for the customization task. The dataset leverages Wikipedia titles and text snippets to generate "corrupted" biased phrases through text-to-speech (TTS) and ASR technologies. During the dataset's construction, the GIZA++ tool was employed for character-level alignment to enhance the diversity of the corruption inventory. This dataset is applicable for training models across different architectures, particularly for resolving the accurate recognition issue of rare and out-of-vocabulary (OOV) phrases in ASR systems.

提供机构：

莫斯科物理技术学院

创建时间：

2023-09-29

5,000+

优质数据集

54 个

任务类型

进入经典数据集