five

Overview of our proposed dataset.

收藏
Figshare2026-02-25 更新2026-04-28 收录
下载链接:
https://figshare.com/articles/dataset/_p_Overview_of_our_proposed_dataset_p_/31415032
下载链接
链接失效反馈
官方服务:
资源简介:
Named Entity Recognition (NER) in regional dialects is a critical yet underexplored area in Natural Language Processing (NLP), especially for low-resource languages like Bangla. While NER systems for Standard Bangla have made progress, no existing resources or models specifically address the challenge of regional dialects such as Barishal, Chittagong, Mymensingh, Noakhali, and Sylhet, which exhibit unique linguistic features that existing models fail to handle effectively. To fill this gap, we introduce ANCHOLIK-NER, the first benchmark dataset for NER in Bangla regional dialects, comprising 17,405 sentences and 101,817 words annotated with 10 entity tags across 5 regions. The dataset was sourced from publicly available resources and supplemented with manual translations, ensuring alignment of named entities across dialects. We evaluate three transformer-based models—Bangla BERT, Bangla Bert Base, and BERT Base Multilingual Cased—on this dataset. Bangla BERT achieved the highest performance overall, with F1-scores of 82.27% (Mymensingh), 81.48% (Barishal), 78.75% (Sylhet), 78.50% (Noakhali), and 75.31% (Chittagong). These results highlight strong recognition capability in Mymensingh and Barishal, while dialectal variation in Chittagong remains challenging. As no prior NER resources exist for Bangla regional dialects, this work provides a foundational dataset and baseline benchmarks to facilitate future research. Future work will focus on dialect-aware model adaptation and expanding coverage to additional regions.
创建时间:
2026-02-25
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作