Combined Speech Dataset
收藏arXiv2025-09-30 收录
下载链接:
https://www.wavlab.org/activities/2024/xeus/
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是一个包含超过100万小时语音数据的综合性集合,覆盖了4057种语言。它将现有的公开可获取语料库与全新创建的语料库相结合。此外,该数据集涵盖了189个语言家族,既包含资源丰富的语言,也包含资源稀缺的语言,其中某些语言的语音数据甚至仅有1小时。在规模上,该数据集提供了108.1万小时的预训练数据,其任务旨在进行语音表征学习。
This dataset is a comprehensive collection of speech data, covering 4057 languages with 1.081 million hours of content. It combines existing publicly available speech corpora and newly created ones. Furthermore, the dataset spans 189 language families, including both resource-rich and resource-scarce languages, where some languages have as little as 1 hour of available speech data. The dataset provides 1.081 million hours of pre-training data, and its targeted task is speech representation learning.
提供机构:
Multiple sources including Global Recordings Network, WikiTongues, Inspirational Films



