five

A Dataset for the Classification of Different Kurdish Dialects

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://data.mendeley.com/datasets/srkp2j4v93
下载链接
链接失效反馈
官方服务:
资源简介:
Kurdish is an Indo-Iranian language that is largely spoken by people of Kurdish descent in the countries of Turkey, Iraq, Iran, and Syria. It contains a number of regional dialects, the most common of which is Northern Kurdish (also known as Kurmanji or Badini), while Central Kurdish (also known as Sorani) is spoken in some regions of Iraq and Iran. A unique Kurdish dialect, Hawrami, often referred to as Gorani, is the primary language spoken in the Hawraman area, which spans portions of western Iran and northeastern Iraq. Despite the fact that the dialects have separate pronunciations, vocabularies, and certain grammatical distinctions, they share a common core. In spite of the difficulties, the Kurdish language continues to be an essential component of Kurdish identity and cultural legacy. It plays an essential role in the protection and promotion of the distinct cultural identity of the Kurdish people. The concepts of language and dialect recognition are intricately interconnected within the fields of linguistics and natural language processing. Having a good dataset for Kurdish dialect recognition improves identification and classification, natural language processing applications for Kurdish, preservation of Kurdish linguistic heritage, cultural insights, customized content and services for users, empowerment of local businesses, and a benchmark for evaluating dialect recognition systems. The presented dataset was gathered by numerous members of the University of Halabja's Computer Science Department's teaching staff over the course of several months. During each stage of the data collecting process, the established policies, procedures, and guidelines were adhered to. This included taking into consideration the ages as well as the genders of the speakers who were included in the dataset. The recordings are taken from a variety of TV programmes and TV interviews that were broadcast on Speda tv, NRT, and GK Sat. There were 2000 instances of the Sorani dialect, 2000 examples of the Badini dialect, and 2000 examples of the Hawrami dialect. The total duration of this dataset is 6000 s, and the duration of each sample is precisely one second. The dataset labeling procedure that has been suggested consists of two consecutive stages. Initially, it is necessary to categorize the distinct sounds of each dialect, namely Sorani, Badini, and Hawrami, into different directories. Following this, it is recommended that the files included inside these folders be systematically labeled from 1 to 2000, according to the prescribed scheme: for Sorani files, the labels should range from s1 to s2000; for Badini files, the labels should range from b1 to b2000; and for Hawrami files, the labels should range from h1 to h2000.
创建时间:
2024-10-15
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作