BOISHOMMO: A Standardized Multi-Label Bangla Hate Speech Dataset for Imbalance Analysis
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://data.mendeley.com/datasets/4tsb6tg9b2
下载链接
链接失效反馈官方服务:
资源简介:
BOISHOMMO is a uniquely structured, multi-label annotated dataset for hate speech analysis in Bangla — a morphologically rich and low-resource language. It addresses a significant gap in Natural Language Processing by providing a rare and detailed resource designed for multi-label classification in a non-Latin script language. The dataset also includes English translations for each Bangla comment, supporting cross-lingual research and enhancing accessibility for international researchers working in multilingual NLP and comparative linguistic studies.
The dataset consists of 2,499 Bangla social media comments collected from public Facebook news pages such as Prothom Alo, Jugantor, and Kaler Kantho. Each comment was carefully and manually annotated by three native Bangla speakers, following strict guidelines to ensure consistency and accuracy. Labels were assigned across 10 overlapping hate categories: Race, Behavior, Physical, Class, Religion, Disability, Ethnicity, Gender, Sexual Orientation, and Political. The final annotation for each comment was determined by a majority voting process, and inter-annotator agreement was measured using Cohen’s Kappa to validate annotation quality.
Besides its multi-aspect annotation structure and linguistic importance, BOISHOMMO emphasizes imbalance analysis. The dataset shows natural label imbalance across hate categories, reflecting real-world distributions and the challenges in hate speech detection. This feature makes it a useful benchmark for testing model robustness, creating effective multi-label classifiers, and exploring techniques like data augmentation and resampling. BOISHOMMO supports the future development of machine learning models and linguistic tools for Bangla and other under-resourced languages, helping promote inclusive and fair NLP research.
创建时间:
2025-08-18



