five

KenSpeech: Swahili Speech Transcriptions

收藏
DataONE2024-06-30 更新2025-04-26 收录
下载链接:
https://search.dataone.org/view/sha256:96bcd3a5d4ebaaac3504ecc8c49a3d5039015d6d71e003dcda743aa655611f8a
下载链接
链接失效反馈
官方服务:
资源简介:
This speech dataset includes both read and spontaneous speech recordings, recorded in Kenya with native Swahili speakers. In total this dataset includes 27 hours 31 minutes 50 seconds of speech data from 26 speakers, that is, 19 females and 7 males. The recordings are of the following audio format: .wav, 16bits, 16kHz, mono and Little Endian. Of the total recordings 26 hours 32 minutes and 37 seconds represent the read speech data while 59 minutes 13 seconds represent the spontaneous speech recordings. Each audio file has a corresponding transcript, for example, an audio file named tweet_5701.wav in audios folder corresponds to the transcript file tweet_5701.txt in the transcripts folder. Additionally, this dataset includes a phonelist file kencorpus.phone containing all the Swahili phones as used by KenCorpus. This phonelist file is crucial as its contents have been used to create the KenCorpus Swahili lexicon-phone dictionary kencorpus.dic which contains all the words in the KenCorpus transcripts with their corresponding pronunciations as per the Swahili phones in the phonelist. The lexicon-phone dictionary contains over 30,000 words. Acknowledgement of data curators: Dorcas Awino, Dr. Benard Okal, Khalid Kitito, Owiny Japheth Otieno

本语音数据集涵盖朗读语音与自发口语两类录音,均由肯尼亚境内以斯瓦希里语为母语的说话者录制。数据集总计包含来自26位说话者(其中19位女性、7位男性)的27小时31分50秒语音数据。录音采用以下音频格式:.wav、16比特、16kHz、单声道、小端序(Little Endian)。其中26小时32分37秒为朗读语音数据,剩余59分13秒为自发口语录音。每条音频文件均配有对应的转写文本,例如audios文件夹下名为tweet_5701.wav的音频文件,对应transcripts文件夹下的tweet_5701.txt转写文件。此外,本数据集还包含一份名为kencorpus.phone的音素列表文件,收录了KenCorpus所使用的全部斯瓦希里语音素。该音素列表文件至关重要,其内容被用于构建KenCorpus斯瓦希里语音素词典kencorpus.dic,该词典收录了KenCorpus转写文本中的全部词汇,并依据音素列表中的斯瓦希里语音素标注了对应发音。该音素词典包含超过30000个词汇。鸣谢数据整理者:Dorcas Awino、Benard Okal博士、Khalid Kitito、Owiny Japheth Otieno
创建时间:
2024-09-25
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作