ChattoBan: A Benchmark Dataset for Language Identification Between Bengali and Chittagonian Dialects

NIAID Data Ecosystem2026-05-10 收录

下载链接：

https://data.mendeley.com/datasets/mfsg573r9t

下载链接

链接失效反馈

官方服务：

资源简介：

Chittagonian is one of the most widely spoken native languages in Bangladesh, with an estimated 14 million speakers across the country and abroad. Although Bengali is the national language, Chittagonian differs significantly in phonology, vocabulary, and grammar. These linguistic differences make automatic language identification an important task for NLP applications such as machine translation, language detection, and sentiment analysis. To address the scarcity of Chittagonian language resources, we introduce ChattoBan, a benchmark dataset designed for sentence-level identification between Bengali and Chittagonian. The dataset contains 6,151 annotated sentences, categorized as follows: Chittagonian: 2,650 sentences Bengali: 3,501 sentences Chittagonian sentences were collected from social media platforms (Facebook, Twitter), Chittagonian news articles, song lyrics, and direct contributions from native speakers. Bengali sentences were sourced from various Bengali newspapers and classical literature to ensure authentic and diverse language representation. To ensure annotation reliability, two native Chittagonian speakers and one native Bengali speaker independently reviewed and validated all sentence labels. Additionally, preprocessing steps such as duplicate removal, punctuation removal, and English character and number filtering were applied to enhance data quality while preserving linguistic authenticity. The ChattoBan dataset has significant implications across multiple NLP and AI domains, including: Language identification for closely related languages Machine translation and code-switching analysis Supervised and semi-supervised learning Sociolinguistic and dialect studies Bangla-centric NLP research and educational applications The ChattoBan dataset is openly available for academic and research purposes, promoting collaboration and innovation within the Bangla NLP community. By providing a reliable benchmark for Bengali–Chittagonian identification, this dataset aims to support future advancements in low-resource language processing.

创建时间：

2025-11-19

5,000+

优质数据集

54 个

任务类型

进入经典数据集