A speech dataset of three ethnic languages of Bangladesh: Chakma, Marma and Garo.

Mendeley Data2026-04-18 收录

下载链接：

https://data.mendeley.com/datasets/yjhybztwf4

下载链接

链接失效反馈

官方服务：

资源简介：

This dataset provides a curated collection of speech recordings from three ethnic languages of Bangladesh: Chakma, Marma, and Garo. It contains a total of 2321 WAV audio recordings, each ranging from 1 to 7 seconds in duration. All recordings were collected from 11 native speakers aged 20–26 years, who read from a consistent set of 211 predefined Bengali sentences. The audio samples were recorded using various smartphone devices across different, naturally occurring acoustic environments, providing diverse background conditions. The dataset includes audio files only, without any accompanying metadata file. Speaker Distribution The dataset includes: Chakma: 5 speakers Marma: 3 speakers Garo: 3 speakers This structured distribution ensures balanced representation across the three ethnic groups. Key Features of the Dataset Total Audio Files: 2321 Ethnic Languages: Chakma, Marma, Garo Total Speakers: 11 Speaker Age Range: 20–26 years Sentences Read: 211 predefined Bengali sentences Audio Duration: 1–7 seconds per file Audio Format: WAV Recording Devices: Various smartphones Metadata: No metadata file included; only audio files are provided Data Organization All recordings are supplied as individual WAV files. Depending on your folder structure, the files may be grouped by ethnic language or speaker (let me know if you want the folder structure documented in the description). Potential Applications This dataset can be used for: Automatic Speech Recognition (ASR) for low-resource languages Ethnic language identification (LID) Speaker identification and verification Acoustic and phonetic analysis Training multilingual or cross-lingual speech models Research on low-resource speech processing Significance Ethnic languages in Bangladesh are significantly underrepresented in speech technology research. This dataset offers an important resource for developing speech-based systems and contributing to language preservation, technology advancement, and computational linguistic studies.

创建时间：

2025-11-24