five

ASR training dataset for Serbian JuzneVesti-SR

收藏
SSH Open MarketPlace2025-07-04 更新2025-07-05 收录
下载链接:
https://marketplace.sshopencloud.eu/dataset/3Jx801
下载链接
链接失效反馈
官方服务:
资源简介:
This corpus consists of audio recordings and manual transcripts from the Južne Vesti website and its host show called the [15 minuta](https://www.juznevesti.com/Tagovi/Intervju-15-minuta.sr.html). The processing of the audio and its alignment to the manual transcripts followed the pipeline of the [ParlaSpeech-HR dataset](http://hdl.handle.net/11356/1494) as closely as possible. Segments in this dataset range from 2 to 30 seconds. Train-dev-test split has been performed with 80:10:10 ratio. As with the ParlaSpeech-HR dataset, two transcriptions are provided; one with transcripts in their raw form (with punctuation, capital letters, numerals) and another normalised with the same rule-based normaliser as was used in ParlaSpeech-HR dataset creation, which is lowercased, punctuation is removed and numerals are replaced with words. The speaker_info attribute is less abundant due to the fact that compared to parliamentary corpora less data is available in this domain, so it covers only the guest name, guest description, host name, and speaker breakdown (when the host or the guest are speaking). This corpus is available for download from the CLARIN.SI repository.

本语料库包含来自南方新闻(Južne Vesti)网站及其王牌访谈节目《15分钟》(https://www.juznevesti.com/Tagovi/Intervju-15-minuta.sr.html)的音频录音与人工转录文本。 音频处理及人工转录文本对齐工作,尽可能严格遵循了ParlaSpeech-HR数据集(ParlaSpeech-HR Dataset,http://hdl.handle.net/11356/1494)的处理流程。 本数据集的音频片段时长介于2至30秒之间,且已按照80:10:10的比例划分为训练集、开发集与测试集。 与ParlaSpeech-HR数据集一致,本数据集提供两种转录文本:一种为原始格式转录文本(保留标点、大写字母与数字);另一种则采用与ParlaSpeech-HR数据集构建时相同的基于规则的归一化工具进行归一化处理,具体规则为将文本全部转为小写、移除标点符号,并将数字替换为对应文字形式。 由于该访谈领域的可用数据相较于议会语料库更为有限,本数据集的说话人信息(speaker_info)属性内容相对匮乏,仅涵盖嘉宾姓名、嘉宾简介、主持人姓名,以及说话人分段标注(用于区分主持人与嘉宾的发言时段)。 本语料库可从CLARIN.SI资源库(CLARIN.SI)下载获取。
创建时间:
2025-07-04
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作