five

Speech recognition alignments for Finnish parliament data

收藏
NIAID Data Ecosystem2026-03-12 收录
下载链接:
https://zenodo.org/record/4581940
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset contains speech from Finnish parliament 2008-2020 plenary sessions, segmented and aligned for speech recognition training. In total, the training set has: 1.4 million samples 3100 hours of audio 460 speakers over 19 million word tokens Additionally, the upload contains 5h long development and 5h long evaluation sets described in publication 10.21437/Interspeech.2017-1115. Due to the size of the training set (~300 GB) and Zenodo upload limit (50 GB), only the development and evaluation sets are published on Zenodo. Rest of the data is available at: http://urn.fi/urn:nbn:fi:lb-2021051903 The training set comes in two parts: 2008-2016 set which is originally described in publication 10.21437/Interspeech.2017-1115. This set includes a list of samples from sessions in 2008-2014 that can be combined with the 2015-2020 set to form the 3100 hour training set. A new 2015-2020 dataset. All audio samples are single-channel, 16 kHz and 16-bit wav files. Each wav file has corresponding transcript in a .trn text file. The data is machine-extracted so there still remains small inaccuracies in the training set transcripts and possibly few Swedish samples. Development and evaluation sets have been corrected by hand. The licenses can be viewed at: http://urn.fi/urn:nbn:fi:lb-2019112822 (audio) http://urn.fi/urn:nbn:fi:lb-2019112823 (text) The code used in extraction is available at: https://github.com/aalto-speech/finnish-parliament-scripts (2008-2014, dev and eval sets) https://github.com/aalto-speech/fi-parliament-tools (2015-2020 set)
创建时间:
2021-05-24
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作