Multilingual test set for language identification and speech recognition from European Parliament recordings

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/record/12784313

下载链接

链接失效反馈

官方服务：

资源简介：

This test set for language identification and speech recognition is composed by multilingual extracts from European Parliament sessions recordings. Dataset description Audio files and official transcripts were downloaded from: https://www.europarl.europa.eu/plenary/en/debates-video.html The test set has a duration of 02h 56m 34s, composed by 15 multilingual audio files of around 12 minutes, selected from the original material to maximize the number of language changes. Official language labels were manually reviewed to fix start/end timestamps, and official text transcripts, where present, were added to the annotation. The test set covers 19 languages in total. The test set is presented in the following paper: M. Valente, F. Brugnara, G. Morrone, E. Zovato, L. Badino, "Exploring Spoken Language Identification Strategies for Automatic Transcription of Multilingual Broadcast and Institutional Speech", accepted to Interspeech 2024. For more information please refer to the README.txt in the testset .zip archive. License and copyright The data is released with CC0 license: https://creativecommons.org/public-domain/cc0/For the raw data, see also European Parliament's legal notice: https://www.europarl.europa.eu/legal-notice/en/

创建时间：

2024-07-19

5,000+

优质数据集

54 个

任务类型

进入经典数据集