A Comprehensive Kurdish Speech Corpus for Speaker Identification and Verification

Mendeley Data2026-05-21 收录

下载链接：

https://data.mendeley.com/datasets/7rv22xjmdx

下载链接

链接失效反馈

官方服务：

资源简介：

Abstract / General Description: This dataset comprises a proprietary acoustic corpus specifically developed for text-independent Speaker Identification and Verification (SIV) within a low-resource language environment (Central Kurdish). The dataset contains 86,505 discrete audio utterances recorded from 200 demographically diverse native speakers. It is designed to address the critical data deficiency in underrepresented computational linguistics and provides a robust empirical foundation for training deep learning biometric architectures. The data is structurally optimized for researchers extracting high-dimensional acoustic representations, specifically 2D log Mel-spectrograms, to execute spatial feature-learning via Convolutional Neural Networks (CNNs). Data Collection and Preprocessing Metrics: Audio Format: Raw audio stored in .ogg format. Utterance Duration: Uniformly normalized to 1.0 second per clip to effectively capture invariant phonetic variations while ensuring computational parameter efficiency. Volume: 86,505 total independent acoustic samples. Density: A minimum of 400 discrete audio samples per individual participant Dataset Partitioning (Train/Validation/Test Split): The dataset partitioning technique carefully separates the testing environment from the training pipeline to prevent data leakage. It is structured into three distinct subsets: Training and Validation Sets (73,538 files): This aggregate subset is divided into an 85% training set and a 15% validation set. The split utilizes a stratified sampling method to meticulously preserve the proportionate representation of each speaker class throughout both subsets. Isolated Test Set (12,967 files): A separate directory of completely unseen audio samples assembled exclusively for final model assessment and cross-dataset evaluation protocols Demographic Distribution: Total Participants: 200 native speakers. Gender Split: 101 Male, 99 Female. Age Cohorts: Under 18: 6 participants 18–25: 47 participants 26–40: 82 participants 41–60: 62 participants Over 60: 3 participants Recommended Usage & Technical Implementation: This corpus is engineered for advanced audio-to-image classification tasks. It is empirically proven to support the extraction of Mel-spectrograms (configured to 64 Mel-frequency bins and 44 temporal frames) for training 2D-CNN topologies. The dataset structure facilitates rigorous cross-dataset evaluation protocols for both multi-class closed-set speaker identification and open-set, threshold-dependent biometric security verification.

创建时间：

2026-05-18

5,000+

优质数据集

54 个

任务类型

进入经典数据集