ChattoBan: A Benchmark Dataset for Language Identification Between Bengali and Chittagonian Dialects
收藏NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://data.mendeley.com/datasets/mfsg573r9t
下载链接
链接失效反馈官方服务:
资源简介:
Chittagonian is one of the most widely spoken native languages in Bangladesh, with an estimated 14 million speakers across the country and abroad. Although Bengali is the national language, Chittagonian differs significantly in phonology, vocabulary, and grammar. These linguistic differences make automatic language identification an important task for NLP applications such as machine translation, language detection, and sentiment analysis.
To address the scarcity of Chittagonian language resources, we introduce ChattoBan, a benchmark dataset designed for sentence-level identification between Bengali and Chittagonian. The dataset contains 6,151 annotated sentences, categorized as follows:
Chittagonian: 2,650 sentences
Bengali: 3,501 sentences
Chittagonian sentences were collected from social media platforms (Facebook, Twitter), Chittagonian news articles, song lyrics, and direct contributions from native speakers. Bengali sentences were sourced from various Bengali newspapers and classical literature to ensure authentic and diverse language representation.
To ensure annotation reliability, two native Chittagonian speakers and one native Bengali speaker independently reviewed and validated all sentence labels. Additionally, preprocessing steps such as duplicate removal, punctuation removal, and English character and number filtering were applied to enhance data quality while preserving linguistic authenticity.
The ChattoBan dataset has significant implications across multiple NLP and AI domains, including:
Language identification for closely related languages
Machine translation and code-switching analysis
Supervised and semi-supervised learning
Sociolinguistic and dialect studies
Bangla-centric NLP research and educational applications
The ChattoBan dataset is openly available for academic and research purposes, promoting collaboration and innovation within the Bangla NLP community. By providing a reliable benchmark for Bengali–Chittagonian identification, this dataset aims to support future advancements in low-resource language processing.
创建时间:
2025-11-19



