UmarRamzan/prus
收藏Hugging Face2024-05-15 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/UmarRamzan/prus
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- ur
task_categories:
- automatic-speech-recognition
dataset_info:
features:
- name: audio
dtype: audio
- name: transcription
dtype: string
splits:
- name: train
num_bytes: 108262823.43502825
num_examples: 566
- name: test
num_bytes: 27161338.564971752
num_examples: 142
download_size: 135363971
dataset_size: 135424162.0
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: test
path: data/test-*
---
The Urdu Phonetically Rich Speech Corpus consists of 70 minutes of transcribed read speech consisting of 708 greedily created sentences representing all phonemic and triphonemic combinations in Urdu (based on an 18 million word corpus of Urdu news articles). It consists of 10,101 tokens with 5,656 unique words. In addition to providing phonetic cover for Urdu, the corpus is also phonemically balanced. It also provides triphonemic cover however it is not completely balanced for triphonemes. It contains 60 unique phones and 42,289 phone occurrences. The sentences contained in this corpus are all manually created by trained linguists following a greedy approach to accommodate the words (which were selected using a set cover algorithm) and to prevent additional words as much as possible. Therefore, while correct grammatically, there are some instances where the choice of words in the sentences is unusual.
Further information and download instructions can be found at https://www.c-salt.org/downloads/prus
Copyright (c) 2017 by Center for Speech and Language Technologies (CSaLT), Information Technology University of the Punjab, Lahore, Pakistan. Your use of the CSaLT Phonetically Rich Urdu Speech Corpus is subject to our Creative Commons License (https://creativecommons.org/licenses/by/4.0/), which lets you distribute, remix, tweak, and build upon our work, even commercially, as long as you credit us for the original creation. You are required to cite the "Center for Speech and Language Technologies (CSaLT)" and the following two publications:
1. Agha Ali Raza, Sarmad Hussain, Huda Sarfraz, Inam Ullah, Zahid Sarfraz, An ASR System for Spontaneous Urdu Speech, Oriental COCOSDA 2010 conference, Nov. 24-25, 2010, Katmandu, Nepal.
2. Agha Ali Raza, Sarmad Hussain, Huda Sarfraz, Inam Ullah, Zahid Sarfraz, Design and development of phonetically rich Urdu speech corpus, Proceedings of O-COCOSDA'09 and IEEE Xplore; O-COCOSDA'09, 10-13 Aug 2009, School of Information Science and Engineering of Xinjiang University, Urunqi, China (URL: http://o-cocosda2009.xju.edu.cn).
提供机构:
UmarRamzan
原始信息汇总
数据集概述
基本信息
- 语言: 乌尔都语 (Urdu)
- 任务类别: 自动语音识别 (automatic-speech-recognition)
数据集特征
- 音频:
- 名称: audio
- 数据类型: audio
- 转录文本:
- 名称: transcription
- 数据类型: string
数据集划分
- 训练集:
- 样本数: 566
- 数据大小: 108262823.43502825 字节
- 测试集:
- 样本数: 142
- 数据大小: 27161338.564971752 字节
数据集大小
- 下载大小: 135363971 字节
- 总数据集大小: 135424162.0 字节
配置信息
- 配置名称: default
- 数据文件路径:
- 训练集: data/train-*
- 测试集: data/test-*



