Chattogram sent: A Multilingual Sentiment Dataset for Chattogram, Bengali , and English (Versions 2)
收藏NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://data.mendeley.com/datasets/crznfsztp9
下载链接
链接失效反馈官方服务:
资源简介:
The Chattogram dialect (Chittangga), widely spoken in southeastern Bangladesh, is primarily an oral language with no standardized writing system. Despite its large speaker population, the dialect remains underrepresented in computational linguistics due to the scarcity of high-quality, manually curated digital resources. This dataset introduces a fully manual, native-curated multilingual sentiment corpus developed entirely by researchers who are native speakers of the Chattogram dialect.
It consists of 4,451 parallel sentences aligned across five distinct columns: Standard Bangla, Chattogram dialect, English, Sentiment labels, and the Source of Data. The inclusion of the 'Source of Data' column provides essential context by categorizing each entry based on its origin, such as social media posts, regional drama scripts, and everyday conversations.
The Chattogram dialect is predominantly spoken in Chattogram city, Cox’s Bazar, and the coastal regions of the Chittagong Hill Tracts, as well as nearby districts of southeastern Bangladesh. Given the oral nature of the dialect, all Chattogram sentences were phonetically transcribed into Bengali script. The dataset follows a translation-first pipeline: each Chattogram sentence was translated into Standard Bangla and then English by the same native speakers to maintain semantic fidelity and cross-lingual alignment.
Sentiment annotation was performed after multilingual alignment, with each sentence categorized as Neutral, Negative, or Positive (Neutral: 1,969; Negative: 1,467; Positive: 1,015). The dataset represents the first high-quality benchmark for sentiment analysis in the Chattogram dialect, enabling researchers to develop low-resource NLP models, dialectal sentiment classifiers, and cross-lingual transformer-based systems. Its native-driven design ensures linguistic authenticity, cultural accuracy, and contextual relevance, providing a valuable resource for the computational study of underrepresented languages.
By combining manual transcription, expert multilingual translation, source-based categorization, and careful sentiment annotation, this corpus supports both academic research and practical applications in natural language processing, multilingual AI systems, and digital preservation of oral language traditions.
创建时间:
2026-01-19



