five

ARL Urdu Speech Database, Training Data

收藏
Mendeley Data2024-01-31 更新2024-06-28 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2007S03
下载链接
链接失效反馈
官方服务:
资源简介:
Introduction ARL Urdu Speech Database, Training Data is a collection of recorded speech from 200 adult native Urdu speakers from Pakistan and Northern India. The recordings in this release were collected by Appen Pty Ltd, Sydney, Australia in 2006. The U.S. Army Research Laboratory (ARL) provided this corpus to the Linguistic Data Consortium for distribution. Urdu is an Indo-Aryan language spoken throughout South Asia that developed under the Mughal Empire and Delhi Sultinate between 1200 AD and 1800 AD. It has Persian, Turkish and Arabic influences, but in fact is a dialect of Hindustani. The word "Urdu" refers to the standardized register of Hindustani, but there are many non-standard idiolects as well. Urdu is the twentieth most spoken language in the world. It is the native language of over 60 million people, it is the offical language of Pakistan, and it is one of India's national languages. Urdu is also spoken in Afghanistan. The distribution of speaker dialects in the corpus is as follows: Accent Number of Speakers South Sindh 29 North Sindh 30 South Punjab 27 North Punjab 29 Captial Area 29 North West Regions 30 Baluchistan 26 The database is divided into two parts, a training set containing approximately 80% of the data and a test set comprised of 20% of the data. This release consists of approximately 80% of the complete dataset (training and test). Data Each speaker was presented with 400 prompts to read: sentences, place names, and person names. Two microphones set at different distances to the speaker were used for the recordings. The recorded speech was stored in raw format files with headers stored in separate directories. Each utterance was transcribed in the corresponding label file for each recording. The transcriptions were encoded in UTF-8. Punctuation was omitted and numbers were written out in full. Update Earlier versions were missing the content list file. This is now available as part of the complete download file. 09/14/18 - The test data for this corpus was originally held back and is now available as part of the download. New downloads after the indicated date will contain the full corpus. Samples For an example of the data in this corpus, please listen to this following audio sample (.wav format) Portions © 2006 U.S. Army Research Laboratory, © 2007 Trustees of the University of Pennsylvania
创建时间:
2024-01-31
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作