Libriheavy
收藏arXiv2024-01-15 更新2024-06-21 收录
下载链接:
https://github.com/k2-fsa/libriheavy
下载链接
链接失效反馈官方服务:
资源简介:
Libriheavy是一个大规模的自动语音识别(ASR)语料库,包含50,000小时的英语朗读语音,源自LibriVox项目。该数据集不仅提供标准化转录,还包括标点符号、大小写和文本上下文等丰富信息,适用于构建灵活的语音识别系统。数据集分为三个训练子集(小、中、大),以及用于验证和测试的评估子集(dev, test-clean, test-other)。创建过程中,提出了一种通用的音频对齐方法,并作为标准包开源,使得构建ASR语料库更为便捷。该数据集适用于多种语音识别相关任务,旨在解决现有数据集在文本格式和上下文信息方面的不足。
Libriheavy is a large-scale automatic speech recognition (ASR) corpus containing 50,000 hours of English read speech sourced from the LibriVox project. In addition to standardized transcriptions, it also provides rich information such as punctuation, capitalization, and textual context, making it suitable for building flexible speech recognition systems. The dataset is split into three training subsets (small, medium, large) as well as evaluation subsets for validation and testing (dev, test-clean, test-other). During its development, a universal audio alignment method was proposed and open-sourced as a standard package, which facilitates the construction of ASR corpora. This corpus is applicable to a variety of speech recognition-related tasks, aiming to address the shortcomings of existing datasets in terms of text formatting and contextual information.
提供机构:
小米公司
创建时间:
2023-09-15



