挪威议会演讲语料库 (NPSC)
收藏arXiv2022-01-26 更新2024-08-06 收录
下载链接:
http://arxiv.org/abs/2201.10881v1
下载链接
链接失效反馈官方服务:
资源简介:
挪威议会演讲语料库(NPSC)是由挪威国家图书馆开发的一个公开语音数据集,旨在用于挪威非脚本语音的声学建模。该数据集包含2017年和2018年挪威议会会议的录音,总时长约140小时,包含1.2百万个单词。数据集中的录音经过手动转录和注释,包括语言代码和发言人信息,以及详细的发言人元数据。转录文本以挪威官方书面标准——挪威布克莫尔语和挪威新挪威语存在,非标准化词汇被明确标记并注释有标准化等效词。NPSC数据集的应用领域主要集中在自动语音识别(ASR)系统的训练和测试,特别是在处理方言和非标准语音方面,旨在提高ASR系统在真实世界应用中的性能。
The Norwegian Parliament Speech Corpus (NPSC) is an open-access speech dataset developed by the National Library of Norway, designed for acoustic modeling of unscripted Norwegian speech. The corpus contains recordings of Norwegian parliamentary sessions from 2017 and 2018, with a total duration of approximately 140 hours and over 1.2 million words. All recordings in the corpus have undergone manual transcription and annotation, including language codes, speaker information, and detailed speaker metadata. The transcribed texts are available in both official Norwegian written standards: Bokmål and Nynorsk, and non-standardized vocabulary is explicitly marked and annotated with its standardized equivalents. The primary application areas of the NPSC dataset center on the training and testing of automatic speech recognition (ASR) systems, especially for processing dialectal and non-standard speech, aiming to enhance the performance of ASR systems in real-world applications.
提供机构:
挪威国家图书馆
创建时间:
2022-01-26
搜集汇总
数据集介绍

背景与挑战
背景概述
挪威议会演讲语料库(NPSC)是一个由挪威国家图书馆开发的公开语音数据集,包含约140小时的挪威议会会议录音和1.2百万个单词,经过手动转录和注释,支持挪威布克莫尔语和新挪威语。该数据集主要用于自动语音识别(ASR)系统的训练和测试,特别关注方言和非标准语音处理,以提升ASR在真实场景中的性能。
以上内容由遇见数据集搜集并总结生成



