five

Bn-HIB: A Benchmark Bengali Multimodal Dataset for Detecting Hate Speech and Inflammatory Content in Memes

收藏
DataCite Commons2026-04-15 更新2026-05-04 收录
下载链接:
https://data.mendeley.com/datasets/9vg79v65nr
下载链接
链接失效反馈
官方服务:
资源简介:
Dataset Overview: The Bn-HIB (Bangla Hate–Inflammatory–Benign) dataset is a novel multimodal resource developed for detecting harmful content in Bengali memes. It contains 3,247 manually annotated memes and is the first dataset to explicitly differentiate inflammatory content from direct hate speech in the Bengali language. Data Splits: The dataset is divided into three standard subsets: Training set (70%): 2,272 instances Validation set (15%): 487 instances Test set (15%): 488 instances | Class | Training | Validation | Testing | Total | | ----------------- | --------- | ---------- | ------- | --------- | | Hate (HM) | 811 | 174 | 173 | 1,158 | | Inflammatory (IM) | 773 | 166 | 167 | 1,106 | | Benign (BM) | 688 | 147 | 148 | 983 | | Total | 2,272 | 487 | 488 | 3,247 | Key Characteristics Multimodal Content: Each instance consists of both image and embedded text. Language Variety: Includes standard Bengali, Bengali-English code-mixed, and code-switched memes. Annotation Process: Annotated by three fluent Bengali speakers. A structured decision tree was used to ensure consistency. Achieved a Fleiss’ kappa score of 0.79, indicating substantial agreement. Data Source: Collected from 25 public Facebook groups and pages with high meme activity. Text Extraction: Text within images was extracted using the Gemini API and manually verified for script accuracy. Significance This dataset provides a valuable benchmark for research in multimodal hate speech detection, particularly for low-resource languages like Bengali. Its distinction between hate and inflammatory content enables more nuanced modelling and analysis of harmful online behaviour.
提供机构:
Mendeley Data
创建时间:
2026-04-15
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作