5roop/juzne_vesti
收藏Hugging Face2023-12-12 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/5roop/juzne_vesti
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- sr
license: cc-by-sa-4.0
size_categories:
- 10K<n<100K
pretty_name: Južne Vesti
dataset_info:
features:
- name: audio
dtype: audio
- name: split
dtype: string
- name: transcript
dtype: string
- name: norm_transcript
dtype: string
- name: guest_name
dtype: string
- name: host
dtype: string
- name: guest_description
dtype: string
- name: speaker_breakdown
dtype: string
splits:
- name: train
num_bytes: 4687838374.879606
num_examples: 8648
- name: test
num_bytes: 584596072.5389507
num_examples: 1081
- name: dev
num_bytes: 583281117.6094437
num_examples: 1082
download_size: 5813877393
dataset_size: 5855715565.028001
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: test
path: data/test-*
- split: dev
path: data/dev-*
handle:
- http://hdl.handle.net/11356/1679
---
# ASR training dataset for Serbian JuzneVesti-SR v1.0
hdl: http://hdl.handle.net/11356/1679
The JuzneVesti-SR dataset consists of audio recordings and manual transcripts from the Južne Vesti website and its host show called '15 minuta' (https://www.juznevesti.com/Tagovi/Intervju-15-minuta.sr.html).
The processing of the audio and its alignment to the manual transcripts followed the pipeline of the ParlaSpeech-HR dataset (http://hdl.handle.net/11356/1494) as closely as possible.
Segments in this dataset range from 2 to 30 seconds.
Train-dev-test split has been performed with 80:10:10 ratio.
As with the ParlaSpeech-HR dataset, two transcriptions are provided; one with transcripts in their raw form (with punctuation, capital letters, numerals) and another normalised with the same rule-based normaliser as was used in ParlaSpeech-HR dataset creation, which is lowercased, punctuation is removed and numerals are replaced with words.
Original transcripts were collected with the help of the ReLDI Centre Belgrade (https://reldi.spur.uzh.ch).
Please cite as
```
@misc{11356/1679,
title = {{ASR} training dataset for Serbian {JuzneVesti}-{SR} v1.0},
author = {Rupnik, Peter and Ljube{\v s}i{\'c}, Nikola},
url = {http://hdl.handle.net/11356/1679},
note = {Slovenian language resource repository {CLARIN}.{SI}},
copyright = {Creative Commons - Attribution-{ShareAlike} 4.0 International ({CC} {BY}-{SA} 4.0)},
issn = {2820-4042},
year = {2022} }
```
提供机构:
5roop
原始信息汇总
数据集概述
基本信息
- 语言: 塞尔维亚语
- 许可证: CC-BY-SA-4.0
- 数据量: 10K<n<100K
- 名称: Južne Vesti
数据集详情
-
特征:
audio: 音频数据split: 数据分割类型(字符串)transcript: 原始转录文本(字符串)norm_transcript: 规范化转录文本(字符串)guest_name: 嘉宾姓名(字符串)host: 主持人(字符串)guest_description: 嘉宾描述(字符串)speaker_breakdown: 说话者细分(字符串)
-
分割:
train: 训练集,包含8648个样本,大小为4687838374.879606字节test: 测试集,包含1081个样本,大小为584596072.5389507字节dev: 开发集,包含1082个样本,大小为583281117.6094437字节
-
下载大小: 5813877393字节
-
数据集大小: 5855715565.028001字节
配置
- 默认配置:
train: 数据路径为data/train-*test: 数据路径为data/test-*dev: 数据路径为data/dev-*
数据集来源
- 数据集包含来自Južne Vesti网站及其节目“15 minuta”的音频记录和手动转录文本。
- 音频和转录文本的对齐处理尽可能遵循ParlaSpeech-HR数据集的流程。
- 数据集中的音频片段时长范围为2到30秒。
- 训练-开发-测试分割比例为80:10:10。
- 提供两种转录文本:原始形式(包含标点、大写字母、数字)和规范化形式(小写、去除标点、数字替换为文字)。
引用
@misc{11356/1679, title = {{ASR} training dataset for Serbian {JuzneVesti}-{SR} v1.0}, author = {Rupnik, Peter and Ljube{v s}i{c}, Nikola}, url = {http://hdl.handle.net/11356/1679}, note = {Slovenian language resource repository {CLARIN}.{SI}}, copyright = {Creative Commons - Attribution-{ShareAlike} 4.0 International ({CC} {BY}-{SA} 4.0)}, issn = {2820-4042}, year = {2022} }



