SoraniTTS dataset : Central Kurdish (CK) Speech Corpus for Text-to-Speech

NIAID Data Ecosystem2026-05-10 收录

下载链接：

https://data.mendeley.com/datasets/jmtn248cc9

下载链接

链接失效反馈

官方服务：

资源简介：

This dataset represents a comprehensive resource for advancing Kurdish TTS systems. Converting text to speech is one of the important topics in the design and construction of multimedia systems, human-machine communication, and information and communication technology, and its purpose, along with speech recognition, is to establish communication between humans and machines in its most basic and natural form, that is, spoken language. For our text corpus, we collected 6,565 sentences from a set of texts in various categories, including news, sport, health, question and exclamation sentences, science, general information, politics, education and literature, story, miscellaneous, and tourism, to create the train sentences. We thoroughly reviewed the texts and normalized them, then they were recorded by a male speaker. We recorded audios in a voice recording studio at 44,100Hz, and all audio files are down sampled to 22,050 Hz in our modeling process. The audio ranges from 3 to 36 seconds in length. We generate the speech corpus in this method, and the last speech has about 6,565 texts and audio pairings, which takes around 19 hours. Altogether, audio files are saved in wave format, and the texts are saved in text files in the corresponding sub-folders. Furthermore, for model training, all of the audio files are gathered in a single folder. Each line in the transcript files is formatted as WAVS | audio file’s name.wav| transcript. The audio file’s name includes the extensions, and the transcript was the speech's text. The audio recording and editing process lasted for 90 days. It involved capturing over 6,565 WAV files and over 19 h of recorded speech. The data set helps researchers improve Kurdish TTS early, thereby reducing the time consumed for this process. Acknowledgments: We would like to express our sincere gratitude to Ayoub Mohammadzadeh for his invaluable support in recording the corpus.

本数据集为推进库尔德语文本转语音（Text-to-Speech, TTS）系统研发提供了一套全面的支撑资源。文本转语音是多媒体系统设计构建、人机交互以及信息通信技术领域的核心研究方向之一；其与语音识别技术一道，旨在以最基础且自然的形式——即口语化表达——实现人机间的通信交互。针对本次研究的文本语料库，我们从涵盖新闻、体育、健康、问答与感叹句、科学、通用资讯、政治、教育与文学、故事、杂项以及旅游等多个类别的文本集合中，采集了共计6565条语句作为训练语句。我们对采集到的文本进行了全面审核与规范化处理，随后由一名男性配音员完成语音录制。语音录制工作于专业录音棚内开展，初始采样率设定为44100Hz；在模型训练流程中，我们将所有音频文件统一下采样至22050Hz。单条音频的时长范围为3秒至36秒。通过上述流程，我们构建了该语音语料库，最终共得到6565条文本-音频配对样本，总录制时长约19小时。所有音频文件均以WAV（Waveform Audio File Format）格式存储，对应的文本文件则存放在各自的子文件夹中。为便于模型训练，我们将所有音频文件统一归集至单个文件夹内。转录文件中的每一行均遵循如下格式：WAVS | 音频文件名.wav | 转录文本。音频文件名包含扩展名，转录文本即为对应语音的原文内容。本次语音录制与后期编辑工作总计耗时90天，共采集得到超过6565条WAV音频文件，总录制时长超19小时。本数据集可助力研究人员加快库尔德语TTS系统的研发进度，从而缩短该领域的研发周期。致谢：我们谨向Ayoub Mohammadzadeh致以诚挚谢意，感谢其在语料库录制过程中提供的宝贵支持。

创建时间：

2025-09-16

5,000+

优质数据集

54 个

任务类型

进入经典数据集