BD-Dialect: A Multiregional Bangla Language Dataset
收藏Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/k769s4vk5z
下载链接
链接失效反馈官方服务:
资源简介:
The BD-Dialect dataset provides parallel linguistic data for Standard Bangla and five of its major regional dialects: Noakhali, Sylheti, Chittagong, Rajshahi, and Mymensingh. It includes aligned translations at both the word and clause levels, along with English translations for cross-linguistic reference.
The dataset is organized into two primary CSV files, each containing 950 entries:
BD-Dialect_Words.csv – Word-level aligned translations across all six language variants.
BD-Dialect_Clauses.csv – Clause/sentence-level aligned translations across all six language variants.
BD-Dialect_Metadata.csv – Detailed metadata describing each column/variable, including validation information.
BD-Dialect_Audio_Samples.zip – A small set of audio recordings (mp4 format) from native speakers for phonetic reference and verification.
BD-Dialect_Preprocessing_Scripts.ipynb – Python Jupyter notebook containing scripts for data cleaning, normalization, and basic analysis.
File Format:
All CSV files are UTF-8 encoded with header rows and can be imported into Python (Pandas), R, Excel, or similar tools. The Jupyter notebook requires a Python environment and was tested in Google Colab.
Usage Notes:
Use the BD-Dialect_Words.csv and BD-Dialect_Clauses.csv files for linguistic analysis or model training.
Refer to BD-Dialect_Metadata.csv to understand the structure, source, and validation status of each linguistic column.
The audio samples are provided as a limited pilot set for phonetic verification and are not a comprehensive audio corpus.
The preprocessing scripts demonstrate the data cleaning pipeline and can be adapted for further analysis.
Applications:
This dataset is designed to support a wide range of research and development activities, including:
Dialect Identification & NLP: Training and evaluating models for dialect classification, speech recognition, and text normalization.
Machine Translation: Developing systems for translation between Standard Bangla and its dialects, or between dialects and English.
Linguistic Research: Enabling comparative studies in dialectology, phonology, and lexical variation.
Resource for Low-Resource Languages: Providing a foundational, validated corpus for Bangla, an underrepresented language in NLP.
Educational Tools: Serving as a resource for language learning and sociolinguistic studies.
Citation:
If you use this dataset, please cite:
Rahman, Anika; Hasan Muna, Nafesha; Prity, Masuma Saba (2026), “BD-Dialect: A Multiregional Bangla Language Dataset”, Mendeley Data, V2, doi: 10.17632/k769s4vk5z.2
License:
CC BY 4.0 – allowing reuse with proper attribution.
创建时间:
2026-01-05



