five

ASR Bundestag

收藏
arXiv2023-02-13 更新2024-06-21 收录
下载链接:
https://opendata.iisys.de/
下载链接
链接失效反馈
官方服务:
资源简介:
ASR Bundestag是由霍夫应用科学大学创建的一个大型德语政治辩论数据集,包含610小时的音频-文本对,适用于德语自动语音识别模型的监督训练。该数据集基于德国议会的全体会议和委员会会议的原始音频数据和转录,反映了大量的即兴演讲以及政治术语和短语,旨在提高德语模型的性能。数据集的创建过程涉及音频和转录的收集、处理和配对,通过两种不同的处理方法确保高质量的音频-文本对齐。该数据集主要用于解决德语自动语音识别领域中数据稀缺的问题,特别是在涉及特定领域词汇和即兴演讲的情况下。

ASR Bundestag is a large-scale German political debate dataset developed by the University of Applied Sciences Hof. It comprises 610 hours of audio-text pairs, designed for supervised training of German automatic speech recognition (ASR) models. Built upon original audio data and transcripts from plenary and committee meetings of the German Bundestag, the dataset includes a substantial volume of impromptu speeches as well as specialized political terminology and phrases, with the objective of enhancing the performance of German ASR models. The dataset creation workflow encompasses the collection, processing and pairing of audio and transcript materials, where two distinct processing methodologies are utilized to guarantee high-quality audio-text alignment. This dataset is primarily developed to address the issue of data scarcity in the German ASR domain, particularly in scenarios involving domain-specific vocabulary and impromptu speeches.
提供机构:
霍夫应用科学大学
创建时间:
2023-02-13
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作