资源简介:
Dataset copied from http://hdl.handle.net/20.500.12537/191 by Reykjavik University.
Information can be found at that link.
RUV TV unknown speakers
About the RUV TV unknown speakers corpus
---------------------------
The RUV TV unknown speakers corpus is 281 hours of TV data from six RÚV TV
shows. The data continas 221,759 utterrances from various unlabelled speakers.
The text is normalized. The data is aligned and segmented, ready for ASR
training. Audio conditions vary between recordings. This data set is published
by the Icelandic National Broadcasting Service - Ríkisútvarpið (RÚV) and made
by both RÚV and Reykjavik University. This work is licensed under the Creative
Commons Attribution 4.0 International License.
This is a broadcast dataset collected from RÚV by Rekjavík University in
2019-2020. So all episodes within this dataset aired in 2019 at the latest. All
episodes were recorded as digital originals. The text originates from RÚV
subtitle (.vtt) and teletext (888). Audio files are 16kHz one channel flac
created from the original .mp4 episodes. The alignment was done using The Kaldi
Speech Recognition Toolkit (https://github.com/kaldi-asr/kaldi) and the scripts
from our alignment repository
(https://github.com/cadia-lvl/alignment-and-segmentation). This dataset was
released in the year 2022 in February (2022-02).
The dataset contains data from the following 6 shows:
Fréttir kl. 19:00 - prime time news
Kastljós - news commentary
Kiljan - literature discussion
Krakkafréttir - news for children
Menningin - arts and culture show
Stundin Okkar - children's variety show
This dataset complements the RÚV TV data. There are no overlapping episodes:
Helgadottir, Inga Run; Fong, Judy Yum; Gudnason, Jon; et al., 2020, RÚV TV
data, CLARIN-IS, http://hdl.handle.net/20.500.12537/93.
The structure of the corpus
---------------------------
<corpus root>
|
. - docs/
|
. - README.txt
|
. - data/
|
. - metadata.tsv
|
. - text
|
. - audio/
|
. - Frettirkl1900/
|
. - 4942689/
|
. - 4942689-00000.flac
|
. - ...
|
. - Kastljos/
|
. - Kiljan/
|
. - Krakkafrettir/
|
. - Menningin/
|
. - StundinOkkar/
|
. - filename.filetype
- metadata.tsv - This is a tab separated file containing utterance_id,
episode_id, show_id, and duration(seconds). Path of the audio file can be
constructed from the show_id, episode_id, and utterance_id
(data/audio/show_id/episode_id/utterance_id.flac) Within each show, the episode
numbers are sequential, meaning episode 4813755 of Kiljan aired before 4813757.
- text - This is a text file like needed for Kaldi's data directories. It
contains the utterance_id followed by the text spoken within the utterance.
Unrecognized words are represented with UNK
Statistics
----------
6 TV shows
281 hrs
221766 utterances
Authors
-------
Reykjavík University
Judy Y Fong - judy@judyyfong.xyz
Inga Run Helgadottir
Helga Svala Sigurðardóttir
Michal Borsky
Ragnheiður Þórhallsdóttir
Jon Gudnason - jg@ru.is
The Icelandic National Broadcasting Service - Ríkisútvarpið (RÚV)
Helga Lara Thorsteinsdottir
Acknowledgements
----------------
This project was funded by the Language Technology Programme for Icelandic
2019-2023. The programme, which is managed and coordinated by Almannarómur, is
funded by the Icelandic Ministry of Education, Science and Culture.
License
-------
This dataset is licensed under Creative Commons - Attribution 4.0 International
(CC BY 4.0) https://creativecommons.org/licenses/by/4.0/
---
dataset_info:
features:
- name: audio_id
dtype: string
- name: audio
dtype:
audio:
sampling_rate: 16000
- name: show_name
dtype: string
- name: episode_id
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 30819626505.488
num_examples: 221766
download_size: 23666124875
dataset_size: 30819626505.488
---
# Dataset Card for "ruv_tv_unknown_speakers"
[More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)