five

5roop/juzne_vesti

收藏
Hugging Face2023-12-12 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/5roop/juzne_vesti
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - sr license: cc-by-sa-4.0 size_categories: - 10K<n<100K pretty_name: Južne Vesti dataset_info: features: - name: audio dtype: audio - name: split dtype: string - name: transcript dtype: string - name: norm_transcript dtype: string - name: guest_name dtype: string - name: host dtype: string - name: guest_description dtype: string - name: speaker_breakdown dtype: string splits: - name: train num_bytes: 4687838374.879606 num_examples: 8648 - name: test num_bytes: 584596072.5389507 num_examples: 1081 - name: dev num_bytes: 583281117.6094437 num_examples: 1082 download_size: 5813877393 dataset_size: 5855715565.028001 configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* - split: dev path: data/dev-* handle: - http://hdl.handle.net/11356/1679 --- # ASR training dataset for Serbian JuzneVesti-SR v1.0 hdl: http://hdl.handle.net/11356/1679 The JuzneVesti-SR dataset consists of audio recordings and manual transcripts from the Južne Vesti website and its host show called '15 minuta' (https://www.juznevesti.com/Tagovi/Intervju-15-minuta.sr.html). The processing of the audio and its alignment to the manual transcripts followed the pipeline of the ParlaSpeech-HR dataset (http://hdl.handle.net/11356/1494) as closely as possible. Segments in this dataset range from 2 to 30 seconds. Train-dev-test split has been performed with 80:10:10 ratio. As with the ParlaSpeech-HR dataset, two transcriptions are provided; one with transcripts in their raw form (with punctuation, capital letters, numerals) and another normalised with the same rule-based normaliser as was used in ParlaSpeech-HR dataset creation, which is lowercased, punctuation is removed and numerals are replaced with words. Original transcripts were collected with the help of the ReLDI Centre Belgrade (https://reldi.spur.uzh.ch). Please cite as ``` @misc{11356/1679, title = {{ASR} training dataset for Serbian {JuzneVesti}-{SR} v1.0}, author = {Rupnik, Peter and Ljube{\v s}i{\'c}, Nikola}, url = {http://hdl.handle.net/11356/1679}, note = {Slovenian language resource repository {CLARIN}.{SI}}, copyright = {Creative Commons - Attribution-{ShareAlike} 4.0 International ({CC} {BY}-{SA} 4.0)}, issn = {2820-4042}, year = {2022} } ```
提供机构:
5roop
原始信息汇总

数据集概述

基本信息

  • 语言: 塞尔维亚语
  • 许可证: CC-BY-SA-4.0
  • 数据量: 10K<n<100K
  • 名称: Južne Vesti

数据集详情

  • 特征:

    • audio: 音频数据
    • split: 数据分割类型(字符串)
    • transcript: 原始转录文本(字符串)
    • norm_transcript: 规范化转录文本(字符串)
    • guest_name: 嘉宾姓名(字符串)
    • host: 主持人(字符串)
    • guest_description: 嘉宾描述(字符串)
    • speaker_breakdown: 说话者细分(字符串)
  • 分割:

    • train: 训练集,包含8648个样本,大小为4687838374.879606字节
    • test: 测试集,包含1081个样本,大小为584596072.5389507字节
    • dev: 开发集,包含1082个样本,大小为583281117.6094437字节
  • 下载大小: 5813877393字节

  • 数据集大小: 5855715565.028001字节

配置

  • 默认配置:
    • train: 数据路径为data/train-*
    • test: 数据路径为data/test-*
    • dev: 数据路径为data/dev-*

数据集来源

  • 数据集包含来自Južne Vesti网站及其节目“15 minuta”的音频记录和手动转录文本。
  • 音频和转录文本的对齐处理尽可能遵循ParlaSpeech-HR数据集的流程。
  • 数据集中的音频片段时长范围为2到30秒。
  • 训练-开发-测试分割比例为80:10:10。
  • 提供两种转录文本:原始形式(包含标点、大写字母、数字)和规范化形式(小写、去除标点、数字替换为文字)。

引用

@misc{11356/1679, title = {{ASR} training dataset for Serbian {JuzneVesti}-{SR} v1.0}, author = {Rupnik, Peter and Ljube{v s}i{c}, Nikola}, url = {http://hdl.handle.net/11356/1679}, note = {Slovenian language resource repository {CLARIN}.{SI}}, copyright = {Creative Commons - Attribution-{ShareAlike} 4.0 International ({CC} {BY}-{SA} 4.0)}, issn = {2820-4042}, year = {2022} }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作