five

tiro-is/ruv_tv_unknown_speakers

收藏
Hugging Face2023-09-08 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/tiro-is/ruv_tv_unknown_speakers
下载链接
链接失效反馈
官方服务:
资源简介:
Dataset copied from http://hdl.handle.net/20.500.12537/191 by Reykjavik University. Information can be found at that link. RUV TV unknown speakers About the RUV TV unknown speakers corpus --------------------------- The RUV TV unknown speakers corpus is 281 hours of TV data from six RÚV TV shows. The data continas 221,759 utterrances from various unlabelled speakers. The text is normalized. The data is aligned and segmented, ready for ASR training. Audio conditions vary between recordings. This data set is published by the Icelandic National Broadcasting Service - Ríkisútvarpið (RÚV) and made by both RÚV and Reykjavik University. This work is licensed under the Creative Commons Attribution 4.0 International License. This is a broadcast dataset collected from RÚV by Rekjavík University in 2019-2020. So all episodes within this dataset aired in 2019 at the latest. All episodes were recorded as digital originals. The text originates from RÚV subtitle (.vtt) and teletext (888). Audio files are 16kHz one channel flac created from the original .mp4 episodes. The alignment was done using The Kaldi Speech Recognition Toolkit (https://github.com/kaldi-asr/kaldi) and the scripts from our alignment repository (https://github.com/cadia-lvl/alignment-and-segmentation). This dataset was released in the year 2022 in February (2022-02). The dataset contains data from the following 6 shows: Fréttir kl. 19:00 - prime time news Kastljós - news commentary Kiljan - literature discussion Krakkafréttir - news for children Menningin - arts and culture show Stundin Okkar - children's variety show This dataset complements the RÚV TV data. There are no overlapping episodes: Helgadottir, Inga Run; Fong, Judy Yum; Gudnason, Jon; et al., 2020, RÚV TV data, CLARIN-IS, http://hdl.handle.net/20.500.12537/93. The structure of the corpus --------------------------- <corpus root> | . - docs/ | . - README.txt | . - data/ | . - metadata.tsv | . - text | . - audio/ | . - Frettirkl1900/ | . - 4942689/ | . - 4942689-00000.flac | . - ... | . - Kastljos/ | . - Kiljan/ | . - Krakkafrettir/ | . - Menningin/ | . - StundinOkkar/ | . - filename.filetype - metadata.tsv - This is a tab separated file containing utterance_id, episode_id, show_id, and duration(seconds). Path of the audio file can be constructed from the show_id, episode_id, and utterance_id (data/audio/show_id/episode_id/utterance_id.flac) Within each show, the episode numbers are sequential, meaning episode 4813755 of Kiljan aired before 4813757. - text - This is a text file like needed for Kaldi's data directories. It contains the utterance_id followed by the text spoken within the utterance. Unrecognized words are represented with UNK Statistics ---------- 6 TV shows 281 hrs 221766 utterances Authors ------- Reykjavík University Judy Y Fong - judy@judyyfong.xyz Inga Run Helgadottir Helga Svala Sigurðardóttir Michal Borsky Ragnheiður Þórhallsdóttir Jon Gudnason - jg@ru.is The Icelandic National Broadcasting Service - Ríkisútvarpið (RÚV) Helga Lara Thorsteinsdottir Acknowledgements ---------------- This project was funded by the Language Technology Programme for Icelandic 2019-2023. The programme, which is managed and coordinated by Almannarómur, is funded by the Icelandic Ministry of Education, Science and Culture. License ------- This dataset is licensed under Creative Commons - Attribution 4.0 International (CC BY 4.0) https://creativecommons.org/licenses/by/4.0/ --- dataset_info: features: - name: audio_id dtype: string - name: audio dtype: audio: sampling_rate: 16000 - name: show_name dtype: string - name: episode_id dtype: string - name: text dtype: string splits: - name: train num_bytes: 30819626505.488 num_examples: 221766 download_size: 23666124875 dataset_size: 30819626505.488 --- # Dataset Card for "ruv_tv_unknown_speakers" [More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
提供机构:
tiro-is
原始信息汇总

数据集概述

数据集名称

RUV TV unknown speakers

数据集描述

RUV TV unknown speakers 语料库包含来自六个RÚV电视台节目的281小时电视数据,共有221,759个来自不同未标记发言者的语句。文本已归一化,数据已对齐和分段,适合自动语音识别(ASR)训练。音频录制条件在不同录音之间有所变化。该数据集由冰岛国家广播服务公司Ríkisútvarpið(RÚV)和雷克雅未克大学联合发布,并根据知识共享署名4.0国际许可协议(Creative Commons Attribution 4.0 International License)进行授权。

数据集来源

该数据集是从RÚV收集的广播数据,由雷克雅未克大学在2019-2020年间采集。所有剧集最晚在2019年播出,并以数字原版形式录制。文本源自RÚV的字幕(.vtt)和图文电视(888)。音频文件为16kHz单声道flac格式,由原始.mp4剧集创建。对齐工作使用The Kaldi Speech Recognition Toolkit和我们的对齐与分段仓库脚本完成。该数据集于2022年2月发布。

包含节目

  • Fréttir kl. 19:00 - 晚间新闻
  • Kastljós - 新闻评论
  • Kiljan - 文学讨论
  • Krakkafréttir - 儿童新闻
  • Menningin - 艺术与文化节目
  • Stundin Okkar - 儿童综艺节目

数据集结构

<corpus root> | . - docs/ | . - README.txt | . - data/ | . - metadata.tsv | . - text | . - audio/ | . - Frettirkl1900/ | . - 4942689/ | . - 4942689-00000.flac | . - ... | . - Kastljos/ | . - Kiljan/ | . - Krakkafrettir/ | . - Menningin/ | . - StundinOkkar/ | . - filename.filetype

  • metadata.tsv:这是一个制表符分隔的文件,包含utterance_id、episode_id、show_id和持续时间(秒)。音频文件的路径可以通过show_id、episode_id和utterance_id构造(data/audio/show_id/episode_id/utterance_id.flac)。每个节目中的剧集编号是连续的。
  • text:这是Kaldi数据目录所需的文本文件,包含utterance_id及其对应的文本。未识别的单词用UNK表示。

统计信息

  • 6个电视节目
  • 281小时
  • 221,766个语句

作者

  • 雷克雅未克大学
  • Judy Y Fong
  • Inga Run Helgadottir
  • Helga Svala Sigurðardóttir
  • Michal Borsky
  • Ragnheiður Þórhallsdóttir
  • Jon Gudnason
  • 冰岛国家广播服务公司Ríkisútvarpið(RÚV)
  • Helga Lara Thorsteinsdottir

许可

该数据集根据知识共享署名4.0国际许可协议(CC BY 4.0)进行授权。

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作