Broadcast News Corpus
收藏arXiv2014-12-15 更新2024-08-06 收录
下载链接:
http://arxiv.org/abs/1412.4616v1
下载链接
链接失效反馈官方服务:
资源简介:
Broadcast News Corpus是由德国慕尼黑工业大学人机通信研究所创建的一个包含超过160小时德语广播新闻的数据集。该数据集主要来源于广播,部分来自电视新闻,音频采样率为16kHz,采用16位PCM编码。数据集经过手动分割和标注,使用XML格式存储,详细记录了音频文件的分割、转录和背景变化等信息。该数据集旨在为德语大词汇连续语音识别(LVCSR)系统的评估和调优提供资源,特别适用于解决广播新闻转录中的挑战,如处理德语的词形变化和复合词等。
Broadcast News Corpus was developed by the Institute of Human-Machine Communication, Technical University of Munich, Germany. This dataset contains over 160 hours of German broadcast news, which is primarily derived from radio broadcasts with a small portion of content coming from television news. The audio data has a sampling rate of 16 kHz and employs 16-bit PCM encoding. The dataset has been manually segmented and annotated, and is stored in XML format that documents detailed information including audio file segmentation, transcripts, background variations, and other relevant metadata. This dataset is designed to serve as a valuable resource for the evaluation and fine-tuning of German large-vocabulary continuous speech recognition (LVCSR) systems, and is particularly well-suited to addressing challenges in broadcast news transcription, such as dealing with German morphological inflection and compound words.
提供机构:
人机通信研究所
创建时间:
2014-12-15



