Karakalpak Speech Corpus
收藏Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/2th8jvft8f
下载链接
链接失效反馈官方服务:
资源简介:
The Karakalpak Speech Corpus is the first large-scale, publicly available speech-to-text dataset for the Karakalpak language, designed to support the development, evaluation, and benchmarking of automatic speech recognition (ASR) systems for this low-resource Turkic language.
Research hypothesis
The core hypothesis behind this dataset is that high-quality, carefully curated speech–text pairs, even at moderate scale, can enable state-of-the-art self-supervised models (such as Wav2Vec 2.0) to achieve strong recognition performance for low-resource languages.
By providing sufficient phonetic, lexical, and speaker diversity, the corpus aims to bridge the data gap that has historically limited Karakalpak speech technology.
What the data contains
The dataset consists of:
Speech recordings in WAV format (16 kHz, 16-bit PCM)
Manually verified transcriptions in standard Karakalpak Latin orthography
Speaker-independent splits for training, validation, and testing
Each audio file corresponds to a single utterance, making the corpus suitable for end-to-end ASR, forced alignment, pronunciation modeling, and acoustic analysis.
The recordings include:
Read speech
Conversational and narrative sentences
Phonetically rich word sequences
Numbers, commands, and daily expressions
This ensures broad coverage of Karakalpak phonology, morphology, and vocabulary.
How the data was gathered
The corpus was collected from native Karakalpak speakers under controlled recording conditions.
All recordings were made in quiet indoor environments using consumer-grade microphones and laptops at 16 kHz. Speakers were instructed to read predefined texts clearly and naturally.
All transcriptions were manually checked and normalized to remove spelling inconsistencies, Unicode artifacts, and non-Karakalpak characters.
This results in a clean and reproducible linguistic representation of spoken Karakalpak.
What the data shows
The dataset demonstrates that:
Karakalpak phonemes and special letters (á, ó, ú, ı, ń, ś, ǵ) can be reliably captured and modeled
A consistent orthography and vocabulary can be established for ASR training
Speaker-independent evaluation is feasible
When used to fine-tune Wav2Vec 2.0 models, the corpus produces low word error rates (WER) and character error rates (CER), confirming that the dataset contains sufficient acoustic and linguistic information for high-quality speech recognition.
创建时间:
2026-01-29



