5roop/juzne_vesti

Name: 5roop/juzne_vesti
Creator: 5roop
Published: 2023-12-12 08:00:11
License: 暂无描述

Hugging Face2023-12-12 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/5roop/juzne_vesti

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - sr license: cc-by-sa-4.0 size_categories: - 10K<n<100K pretty_name: Južne Vesti dataset_info: features: - name: audio dtype: audio - name: split dtype: string - name: transcript dtype: string - name: norm_transcript dtype: string - name: guest_name dtype: string - name: host dtype: string - name: guest_description dtype: string - name: speaker_breakdown dtype: string splits: - name: train num_bytes: 4687838374.879606 num_examples: 8648 - name: test num_bytes: 584596072.5389507 num_examples: 1081 - name: dev num_bytes: 583281117.6094437 num_examples: 1082 download_size: 5813877393 dataset_size: 5855715565.028001 configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* - split: dev path: data/dev-* handle: - http://hdl.handle.net/11356/1679 --- # ASR training dataset for Serbian JuzneVesti-SR v1.0 hdl: http://hdl.handle.net/11356/1679 The JuzneVesti-SR dataset consists of audio recordings and manual transcripts from the Južne Vesti website and its host show called '15 minuta' (https://www.juznevesti.com/Tagovi/Intervju-15-minuta.sr.html). The processing of the audio and its alignment to the manual transcripts followed the pipeline of the ParlaSpeech-HR dataset (http://hdl.handle.net/11356/1494) as closely as possible. Segments in this dataset range from 2 to 30 seconds. Train-dev-test split has been performed with 80:10:10 ratio. As with the ParlaSpeech-HR dataset, two transcriptions are provided; one with transcripts in their raw form (with punctuation, capital letters, numerals) and another normalised with the same rule-based normaliser as was used in ParlaSpeech-HR dataset creation, which is lowercased, punctuation is removed and numerals are replaced with words. Original transcripts were collected with the help of the ReLDI Centre Belgrade (https://reldi.spur.uzh.ch). Please cite as ``` @misc{11356/1679, title = {{ASR} training dataset for Serbian {JuzneVesti}-{SR} v1.0}, author = {Rupnik, Peter and Ljube{\v s}i{\'c}, Nikola}, url = {http://hdl.handle.net/11356/1679}, note = {Slovenian language resource repository {CLARIN}.{SI}}, copyright = {Creative Commons - Attribution-{ShareAlike} 4.0 International ({CC} {BY}-{SA} 4.0)}, issn = {2820-4042}, year = {2022} } ```

提供机构：

5roop

原始信息汇总

数据集概述

基本信息

语言: 塞尔维亚语
许可证: CC-BY-SA-4.0
数据量: 10K<n<100K
名称: Južne Vesti

数据集详情

特征:
- audio: 音频数据
- split: 数据分割类型（字符串）
- transcript: 原始转录文本（字符串）
- norm_transcript: 规范化转录文本（字符串）
- guest_name: 嘉宾姓名（字符串）
- host: 主持人（字符串）
- guest_description: 嘉宾描述（字符串）
- speaker_breakdown: 说话者细分（字符串）
分割:
- train: 训练集，包含8648个样本，大小为4687838374.879606字节
- test: 测试集，包含1081个样本，大小为584596072.5389507字节
- dev: 开发集，包含1082个样本，大小为583281117.6094437字节
下载大小: 5813877393字节
数据集大小: 5855715565.028001字节

配置

默认配置:
- train: 数据路径为data/train-*
- test: 数据路径为data/test-*
- dev: 数据路径为data/dev-*

数据集来源

数据集包含来自Južne Vesti网站及其节目“15 minuta”的音频记录和手动转录文本。
音频和转录文本的对齐处理尽可能遵循ParlaSpeech-HR数据集的流程。
数据集中的音频片段时长范围为2到30秒。
训练-开发-测试分割比例为80:10:10。
提供两种转录文本：原始形式（包含标点、大写字母、数字）和规范化形式（小写、去除标点、数字替换为文字）。

引用

@misc{11356/1679, title = {{ASR} training dataset for Serbian {JuzneVesti}-{SR} v1.0}, author = {Rupnik, Peter and Ljube{v s}i{c}, Nikola}, url = {http://hdl.handle.net/11356/1679}, note = {Slovenian language resource repository {CLARIN}.{SI}}, copyright = {Creative Commons - Attribution-{ShareAlike} 4.0 International ({CC} {BY}-{SA} 4.0)}, issn = {2820-4042}, year = {2022} }

5,000+

优质数据集

54 个

任务类型

进入经典数据集