BanglaDial: A Merged and Imbalanced text Dataset for Bengali Regional dialect analysis.

NIAID Data Ecosystem2026-05-10 收录

下载链接：

https://data.mendeley.com/datasets/sx6ybcps2n

下载链接

链接失效反馈

官方服务：

资源简介：

This dataset is gathered from online repositories and contains sentences in 12 distinct regional dialects of Bangladesh. The primary goal of this dataset is to support research in dialect classification, language modeling, and sociolinguistic analysis of Bangladeshi dialects. The dataset exhibits an imbalanced distribution of dialects, which reflects the natural variation in speaker population and data availability across regions. Before finalizing the corpus, several preprocessing steps were performed to ensure quality and consistency. The process began with dataset source identification and merging of different resources, followed by duplicate removal to avoid redundancy. Social media-specific elements such as mentions and hashtags were cleaned, along with the elimination of emojis that did not contribute to textual meaning. Next, punctuation and special characters were removed to maintain a cleaner text structure, and finally, whitespace normalization was applied to ensure uniform formatting. After these steps, the final dataset was generated in a ready-to-use format. The dataset is structured in two columns: (i) Sentence, representing a text string written in Bengali, and (ii) Class, indicating the name of the dialect region (e.g., Chittagong, Rajshahi). The dataset is provided in CSV and XLSX formats. Dialect-Wise Sentence Distribution Chittagong: 8,661 Kishoreganj: 8,694 Narail: 7,746 Tangail: 5,410 Rangpur: 5,881 Narsingdi: 5,735 Standard Bangla: 4,403 Barisal: 4,046 Sylhet: 3,710 Mymensingh: 3,096 Noakhali: 2,462 Rajshahi: 885 Total: 60,729 sentences

创建时间：

2025-09-29

5,000+

优质数据集

54 个

任务类型

进入经典数据集