five

Multilingual test set for language identification and speech recognition from European Parliament recordings

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/12784313
下载链接
链接失效反馈
官方服务:
资源简介:
This test set for language identification and speech recognition is composed by multilingual extracts from European Parliament sessions recordings.  Dataset description Audio files and official transcripts were downloaded from: https://www.europarl.europa.eu/plenary/en/debates-video.html The test set has a duration of 02h 56m 34s, composed by 15 multilingual audio files of around 12 minutes, selected from the original material to maximize the number of language changes.  Official language labels were manually reviewed to fix start/end timestamps, and official text transcripts, where present, were added to the annotation. The test set covers 19 languages in total. The test set is presented in the following paper: M. Valente, F. Brugnara, G. Morrone, E. Zovato, L. Badino, "Exploring Spoken Language Identification Strategies for Automatic Transcription of Multilingual Broadcast and Institutional Speech", accepted to Interspeech 2024. For more information please refer to the README.txt in the testset .zip archive. License and copyright The data is released with CC0 license: https://creativecommons.org/public-domain/cc0/For the raw data, see also European Parliament's legal notice: https://www.europarl.europa.eu/legal-notice/en/
创建时间:
2024-07-19
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作