espnet/yodas_owsmv4
收藏Hugging Face2025-09-01 更新2025-07-05 收录
下载链接:
https://hf-mirror.com/datasets/espnet/yodas_owsmv4
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含了跨越75种语言的166,000小时的多语种语音,被分割成30秒长的音频片段。数据来源于基于大规模网络爬取内容构建的YODAS2数据集。由于网络源数据的特性,原始的YODAS2数据集可能包含不准确的语言标签和音频-文本对不齐的问题。为解决这一问题,开发了一个可扩展的数据清洗流程,使用公开的工具包,从而形成了原始数据集的精选子集。清洗后的数据集成为训练OWSM v4模型的核心训练数据,OWSM v4模型结合现有的OWSM数据后在多语种自动语音识别基准测试中显著超越了之前的版本。
This dataset comprises 166,000 hours of multilingual speech spanning 75 languages, segmented into 30-second long-form audio clips. The data is sourced from the YODAS2 dataset, which is based on large-scale web-crawled content. Due to the nature of web-sourced data, the original YODAS2 dataset may include inaccurate language labels and misaligned audio-text pairs. A scalable data-cleaning pipeline has been developed to address this issue, resulting in a curated subset of the original dataset. This cleaned dataset is used as core training data for the OWSM v4 models, which, when combined with existing OWSM data, significantly outperform previous versions on multilingual ASR benchmarks.
提供机构:
espnet



