BADD: A Large-Scale Dataset for Arrogance Detection in the Bengali Language

NIAID Data Ecosystem2026-05-10 收录

下载链接：

https://data.mendeley.com/datasets/fyzy2z8nzx

下载链接

链接失效反馈

官方服务：

资源简介：

This dataset contains 46,128 labeled Bengali comments curated for the task of arrogance detection. While existing datasets focus heavily on hate speech or cyberbullying, this dataset addresses the subtle linguistic nuances of "arrogance", characterized by overbearing pride, lack of empathy, and social superiority, which is often expressed without overt toxicity. The data was compiled to support research in Bengali NLP. It serves as the primary resource for training the high-performing BanglaBERT model (96% accuracy) described in the accompanying research paper. Dataset Structure The dataset is provided in a single .csv file with the following columns: comment: The raw Bengali text. source: The origin of the comment (online or AI). weak_label: Initial label assigned by heuristic functions. snorkel_label: Refined label produced by the Snorkel framework. final_label: The target label for classification. 1: Arrogant 0:Non-arrogant **Further an automaited English translated dataset is attached as test_translated_data.csv

创建时间：

2026-03-12

5,000+

优质数据集

54 个

任务类型

进入经典数据集