A speech dataset of three ethnic languages of Bangladesh: Chakma, Marma and Garo.
收藏Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/yjhybztwf4
下载链接
链接失效反馈官方服务:
资源简介:
This dataset provides a curated collection of speech recordings from three ethnic languages of Bangladesh: Chakma, Marma, and Garo. It contains a total of 2321 WAV audio recordings, each ranging from 1 to 7 seconds in duration. All recordings were collected from 11 native speakers aged 20–26 years, who read from a consistent set of 211 predefined Bengali sentences.
The audio samples were recorded using various smartphone devices across different, naturally occurring acoustic environments, providing diverse background conditions. The dataset includes audio files only, without any accompanying metadata file.
Speaker Distribution
The dataset includes:
Chakma: 5 speakers
Marma: 3 speakers
Garo: 3 speakers
This structured distribution ensures balanced representation across the three ethnic groups.
Key Features of the Dataset
Total Audio Files: 2321
Ethnic Languages: Chakma, Marma, Garo
Total Speakers: 11
Speaker Age Range: 20–26 years
Sentences Read: 211 predefined Bengali sentences
Audio Duration: 1–7 seconds per file
Audio Format: WAV
Recording Devices: Various smartphones
Metadata: No metadata file included; only audio files are provided
Data Organization
All recordings are supplied as individual WAV files. Depending on your folder structure, the files may be grouped by ethnic language or speaker (let me know if you want the folder structure documented in the description).
Potential Applications
This dataset can be used for:
Automatic Speech Recognition (ASR) for low-resource languages
Ethnic language identification (LID)
Speaker identification and verification
Acoustic and phonetic analysis
Training multilingual or cross-lingual speech models
Research on low-resource speech processing
Significance
Ethnic languages in Bangladesh are significantly underrepresented in speech technology research. This dataset offers an important resource for developing speech-based systems and contributing to language preservation, technology advancement, and computational linguistic studies.
创建时间:
2025-11-24



