five

UmarRamzan/prus

收藏
Hugging Face2024-05-15 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/UmarRamzan/prus
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - ur task_categories: - automatic-speech-recognition dataset_info: features: - name: audio dtype: audio - name: transcription dtype: string splits: - name: train num_bytes: 108262823.43502825 num_examples: 566 - name: test num_bytes: 27161338.564971752 num_examples: 142 download_size: 135363971 dataset_size: 135424162.0 configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* --- The Urdu Phonetically Rich Speech Corpus consists of 70 minutes of transcribed read speech consisting of 708 greedily created sentences representing all phonemic and triphonemic combinations in Urdu (based on an 18 million word corpus of Urdu news articles). It consists of 10,101 tokens with 5,656 unique words. In addition to providing phonetic cover for Urdu, the corpus is also phonemically balanced. It also provides triphonemic cover however it is not completely balanced for triphonemes. It contains 60 unique phones and 42,289 phone occurrences. The sentences contained in this corpus are all manually created by trained linguists following a greedy approach to accommodate the words (which were selected using a set cover algorithm) and to prevent additional words as much as possible. Therefore, while correct grammatically, there are some instances where the choice of words in the sentences is unusual. Further information and download instructions can be found at https://www.c-salt.org/downloads/prus Copyright (c) 2017 by Center for Speech and Language Technologies (CSaLT), Information Technology University of the Punjab, Lahore, Pakistan. Your use of the CSaLT Phonetically Rich Urdu Speech Corpus is subject to our Creative Commons License (https://creativecommons.org/licenses/by/4.0/), which lets you distribute, remix, tweak, and build upon our work, even commercially, as long as you credit us for the original creation. You are required to cite the "Center for Speech and Language Technologies (CSaLT)" and the following two publications: 1. Agha Ali Raza, Sarmad Hussain, Huda Sarfraz, Inam Ullah, Zahid Sarfraz, An ASR System for Spontaneous Urdu Speech, Oriental COCOSDA 2010 conference, Nov. 24-25, 2010, Katmandu, Nepal. 2. Agha Ali Raza, Sarmad Hussain, Huda Sarfraz, Inam Ullah, Zahid Sarfraz, Design and development of phonetically rich Urdu speech corpus, Proceedings of O-COCOSDA'09 and IEEE Xplore; O-COCOSDA'09, 10-13 Aug 2009, School of Information Science and Engineering of Xinjiang University, Urunqi, China (URL: http://o-cocosda2009.xju.edu.cn).
提供机构:
UmarRamzan
原始信息汇总

数据集概述

基本信息

  • 语言: 乌尔都语 (Urdu)
  • 任务类别: 自动语音识别 (automatic-speech-recognition)

数据集特征

  • 音频:
    • 名称: audio
    • 数据类型: audio
  • 转录文本:
    • 名称: transcription
    • 数据类型: string

数据集划分

  • 训练集:
    • 样本数: 566
    • 数据大小: 108262823.43502825 字节
  • 测试集:
    • 样本数: 142
    • 数据大小: 27161338.564971752 字节

数据集大小

  • 下载大小: 135363971 字节
  • 总数据集大小: 135424162.0 字节

配置信息

  • 配置名称: default
  • 数据文件路径:
    • 训练集: data/train-*
    • 测试集: data/test-*
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作