five

espnet/yodas_owsmv4

收藏
Hugging Face2025-09-01 更新2025-07-05 收录
下载链接:
https://hf-mirror.com/datasets/espnet/yodas_owsmv4
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集包含了跨越75种语言的166,000小时的多语种语音,被分割成30秒长的音频片段。数据来源于基于大规模网络爬取内容构建的YODAS2数据集。由于网络源数据的特性,原始的YODAS2数据集可能包含不准确的语言标签和音频-文本对不齐的问题。为解决这一问题,开发了一个可扩展的数据清洗流程,使用公开的工具包,从而形成了原始数据集的精选子集。清洗后的数据集成为训练OWSM v4模型的核心训练数据,OWSM v4模型结合现有的OWSM数据后在多语种自动语音识别基准测试中显著超越了之前的版本。

This dataset comprises 166,000 hours of multilingual speech spanning 75 languages, segmented into 30-second long-form audio clips. The data is sourced from the YODAS2 dataset, which is based on large-scale web-crawled content. Due to the nature of web-sourced data, the original YODAS2 dataset may include inaccurate language labels and misaligned audio-text pairs. A scalable data-cleaning pipeline has been developed to address this issue, resulting in a curated subset of the original dataset. This cleaned dataset is used as core training data for the OWSM v4 models, which, when combined with existing OWSM data, significantly outperform previous versions on multilingual ASR benchmarks.
提供机构:
espnet
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作