five

BD-Dialect: A Multiregional Bangla Language Dataset

收藏
Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/k769s4vk5z
下载链接
链接失效反馈
官方服务:
资源简介:
The BD-Dialect dataset provides parallel linguistic data for Standard Bangla and five of its major regional dialects: Noakhali, Sylheti, Chittagong, Rajshahi, and Mymensingh. It includes aligned translations at both the word and clause levels, along with English translations for cross-linguistic reference. The dataset is organized into two primary CSV files, each containing 950 entries: BD-Dialect_Words.csv – Word-level aligned translations across all six language variants. BD-Dialect_Clauses.csv – Clause/sentence-level aligned translations across all six language variants. BD-Dialect_Metadata.csv – Detailed metadata describing each column/variable, including validation information. BD-Dialect_Audio_Samples.zip – A small set of audio recordings (mp4 format) from native speakers for phonetic reference and verification. BD-Dialect_Preprocessing_Scripts.ipynb – Python Jupyter notebook containing scripts for data cleaning, normalization, and basic analysis. File Format: All CSV files are UTF-8 encoded with header rows and can be imported into Python (Pandas), R, Excel, or similar tools. The Jupyter notebook requires a Python environment and was tested in Google Colab. Usage Notes: Use the BD-Dialect_Words.csv and BD-Dialect_Clauses.csv files for linguistic analysis or model training. Refer to BD-Dialect_Metadata.csv to understand the structure, source, and validation status of each linguistic column. The audio samples are provided as a limited pilot set for phonetic verification and are not a comprehensive audio corpus. The preprocessing scripts demonstrate the data cleaning pipeline and can be adapted for further analysis. Applications: This dataset is designed to support a wide range of research and development activities, including: Dialect Identification & NLP: Training and evaluating models for dialect classification, speech recognition, and text normalization. Machine Translation: Developing systems for translation between Standard Bangla and its dialects, or between dialects and English. Linguistic Research: Enabling comparative studies in dialectology, phonology, and lexical variation. Resource for Low-Resource Languages: Providing a foundational, validated corpus for Bangla, an underrepresented language in NLP. Educational Tools: Serving as a resource for language learning and sociolinguistic studies. Citation: If you use this dataset, please cite: Rahman, Anika; Hasan Muna, Nafesha; Prity, Masuma Saba (2026), “BD-Dialect: A Multiregional Bangla Language Dataset”, Mendeley Data, V2, doi: 10.17632/k769s4vk5z.2 License: CC BY 4.0 – allowing reuse with proper attribution.
创建时间:
2026-01-05
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作