five

BanglaDial: A Merged and Imbalanced text Dataset for Bengali Regional dialect analysis.

收藏
NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://data.mendeley.com/datasets/sx6ybcps2n
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset is gathered from online repositories and contains sentences in 12 distinct regional dialects of Bangladesh. The primary goal of this dataset is to support research in dialect classification, language modeling, and sociolinguistic analysis of Bangladeshi dialects. The dataset exhibits an imbalanced distribution of dialects, which reflects the natural variation in speaker population and data availability across regions. Before finalizing the corpus, several preprocessing steps were performed to ensure quality and consistency. The process began with dataset source identification and merging of different resources, followed by duplicate removal to avoid redundancy. Social media-specific elements such as mentions and hashtags were cleaned, along with the elimination of emojis that did not contribute to textual meaning. Next, punctuation and special characters were removed to maintain a cleaner text structure, and finally, whitespace normalization was applied to ensure uniform formatting. After these steps, the final dataset was generated in a ready-to-use format. The dataset is structured in two columns: (i) Sentence, representing a text string written in Bengali, and (ii) Class, indicating the name of the dialect region (e.g., Chittagong, Rajshahi). The dataset is provided in CSV and XLSX formats. Dialect-Wise Sentence Distribution Chittagong: 8,661 Kishoreganj: 8,694 Narail: 7,746 Tangail: 5,410 Rangpur: 5,881 Narsingdi: 5,735 Standard Bangla: 4,403 Barisal: 4,046 Sylhet: 3,710 Mymensingh: 3,096 Noakhali: 2,462 Rajshahi: 885 Total: 60,729 sentences
创建时间:
2025-09-29
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作