Multilingual test set for language identification and speech recognition from European Parliament recordings
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/12784313
下载链接
链接失效反馈官方服务:
资源简介:
This test set for language identification and speech recognition is composed by multilingual extracts from European Parliament sessions recordings.
Dataset description
Audio files and official transcripts were downloaded from: https://www.europarl.europa.eu/plenary/en/debates-video.html
The test set has a duration of 02h 56m 34s, composed by 15 multilingual audio files of around 12 minutes, selected from the original material to maximize the number of language changes.
Official language labels were manually reviewed to fix start/end timestamps, and official text transcripts, where present, were added to the annotation.
The test set covers 19 languages in total.
The test set is presented in the following paper:
M. Valente, F. Brugnara, G. Morrone, E. Zovato, L. Badino, "Exploring Spoken Language Identification Strategies for Automatic Transcription of Multilingual Broadcast and Institutional Speech", accepted to Interspeech 2024.
For more information please refer to the README.txt in the testset .zip archive.
License and copyright
The data is released with CC0 license: https://creativecommons.org/public-domain/cc0/For the raw data, see also European Parliament's legal notice: https://www.europarl.europa.eu/legal-notice/en/
创建时间:
2024-07-19



