CU MultiLang Dataset
收藏arXiv2023-08-29 更新2024-06-21 收录
下载链接:
https://www.speechdata.com/datasets/cu_multilang
下载链接
链接失效反馈官方服务:
资源简介:
CU MultiLang Dataset是由哥伦比亚大学创建的一个大型多语言语音数据集,包含51种语言,每种语言最多有10小时的语音数据及相应的文本转录。数据集通过整合多个开放源数据集构建,旨在覆盖广泛的语系和提高说话者多样性。创建过程中,数据集被分为32种内集语言和19种外集语言,用于训练和评估开放集语音语言识别系统。该数据集主要应用于开放集语音语言识别,旨在解决现有系统无法识别未知语言的问题。
CU MultiLang Dataset is a large-scale multilingual speech dataset developed by Columbia University. It covers 51 languages, each containing up to 10 hours of speech data along with corresponding text transcriptions. Constructed by integrating multiple open-source datasets, this dataset aims to cover a broad spectrum of language families and enhance speaker diversity. During its development, the dataset is divided into 32 in-set languages and 19 out-of-set languages, which are utilized for training and evaluating open-set speech language recognition systems. Primarily applied to open-set speech language recognition, this dataset is designed to solve the problem where existing systems cannot identify unknown languages.
提供机构:
哥伦比亚大学
创建时间:
2023-08-29



