BanglaDial: A Merged and Imbalanced text Dataset for Bengali Regional dialect analysis.
收藏NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://data.mendeley.com/datasets/sx6ybcps2n
下载链接
链接失效反馈官方服务:
资源简介:
This dataset is gathered from online repositories and contains sentences in 12 distinct regional dialects of Bangladesh. The primary goal of this dataset is to support research in dialect classification, language modeling, and sociolinguistic analysis of Bangladeshi dialects. The dataset exhibits an imbalanced distribution of dialects, which reflects the natural variation in speaker population and data availability across regions.
Before finalizing the corpus, several preprocessing steps were performed to ensure quality and consistency. The process began with dataset source identification and merging of different resources, followed by duplicate removal to avoid redundancy. Social media-specific elements such as mentions and hashtags were cleaned, along with the elimination of emojis that did not contribute to textual meaning. Next, punctuation and special characters were removed to maintain a cleaner text structure, and finally, whitespace normalization was applied to ensure uniform formatting. After these steps, the final dataset was generated in a ready-to-use format.
The dataset is structured in two columns: (i) Sentence, representing a text string written in Bengali, and (ii) Class, indicating the name of the dialect region (e.g., Chittagong, Rajshahi). The dataset is provided in CSV and XLSX formats.
Dialect-Wise Sentence Distribution
Chittagong: 8,661
Kishoreganj: 8,694
Narail: 7,746
Tangail: 5,410
Rangpur: 5,881
Narsingdi: 5,735
Standard Bangla: 4,403
Barisal: 4,046
Sylhet: 3,710
Mymensingh: 3,096
Noakhali: 2,462
Rajshahi: 885
Total: 60,729 sentences
创建时间:
2025-09-29



