BADD: A Large-Scale Dataset for Arrogance Detection in the Bengali Language
收藏NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://data.mendeley.com/datasets/fyzy2z8nzx
下载链接
链接失效反馈官方服务:
资源简介:
This dataset contains 46,128 labeled Bengali comments curated for the task of arrogance detection. While existing datasets focus heavily on hate speech or cyberbullying, this dataset addresses the subtle linguistic nuances of "arrogance", characterized by overbearing pride, lack of empathy, and social superiority, which is often expressed without overt toxicity.
The data was compiled to support research in Bengali NLP. It serves as the primary resource for training the high-performing BanglaBERT model (96% accuracy) described in the accompanying research paper.
Dataset Structure
The dataset is provided in a single .csv file with the following columns:
comment: The raw Bengali text.
source: The origin of the comment (online or AI).
weak_label: Initial label assigned by heuristic functions.
snorkel_label: Refined label produced by the Snorkel framework.
final_label: The target label for classification.
1: Arrogant
0:Non-arrogant
**Further an automaited English translated dataset is attached as test_translated_data.csv
创建时间:
2026-03-12



