five

NISV 81k Dutch TV Speech Data Set

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/14883497
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset was developed as part of a Dutch HOSAN research program exploring the feasibility of utilizing heritage datasets from the Netherlands to create speech models that represent all Dutch voices.  The dataset contains a large quantity of Dutch audio data from Dutch television broadcasts in the period 1972-2022, stored at the Netherlands Institute for Sound & Vision. The audio files add up to a total of 81k hours of audio, with most audio files having a length of 30 minutes to 1 hour. An initial selection was made of material from the period 1972-2022 that met the following criteria: TV broadcasts excluding international news Radio broadcasts from the radio station NPO Radio 1 Excluding music-related genres Broadcast programme material only (no rushes etc.) Programme duration available in the metadata Digital carrier available This initial selection contained approximately 184k hours of TV and 128k hours of radio. For training speech models, only the TV data was selected. The set was further reduced by selecting specific genres (see genres.txt file), and by removing audio with a length longer than three hours. Only a single broadcast per day of any given series (e.g. one single edition of the Dutch public broadcaster's news programme per day) was selected, as it was a requirement for training the speech models that the set contained as little duplication of audio fragments as possible. Low-resolution versions of the MXF carriers were downloaded, the audio (in AAC format) extracted and this dataset delivered to the researchers under secure conditions with strict non-disclosure agreements in place regarding both the data and the resulting models. Initial use of the data revealed that eighty-eight audio files contained a virtually flat audio signal. Investigation of a sample at Sound & Vision revealed that these came from videos for which the original analogue carriers contained no audio signal. The carrier IDs of these files are contained in the file 'no_audio.txt'.  This published version of the dataset contains the following files: filtered_any_genre_cc0.zip filtered_any_genre_cc0.csv - A dataframe containing the IDs of the programmes and their digital carriers, and non-copyrighted metadata about the programme such as title and broadcast date. segments.txt  - The timecodes of the sections of the carriers used in training the speech models genres.txt - a list of the genres selected (in Dutch) no_audio.txt - a list of the carriers without significant audio content The audio files themselves are under copyright. The published dataset serves as a reference standard for detailing any research conducted using it.
创建时间:
2025-02-20
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作