Bn-HIB: A Benchmark Bengali Multimodal Dataset for Detecting Hate Speech and Inflammatory Content in Memes
收藏DataCite Commons2026-04-15 更新2026-05-04 收录
下载链接:
https://data.mendeley.com/datasets/9vg79v65nr/1
下载链接
链接失效反馈官方服务:
资源简介:
Dataset Overview:
The Bn-HIB (Bangla Hate–Inflammatory–Benign) dataset is a novel multimodal resource developed for detecting harmful content in Bengali memes. It contains 3,247 manually annotated memes and is the first dataset to explicitly differentiate inflammatory content from direct hate speech in the Bengali language.
Data Splits:
The dataset is divided into three standard subsets:
Training set (70%): 2,272 instances
Validation set (15%): 487 instances
Test set (15%): 488 instances
| Class | Training | Validation | Testing | Total |
| ----------------- | --------- | ---------- | ------- | --------- |
| Hate (HM) | 811 | 174 | 173 | 1,158 |
| Inflammatory (IM) | 773 | 166 | 167 | 1,106 |
| Benign (BM) | 688 | 147 | 148 | 983 |
| Total | 2,272 | 487 | 488 | 3,247 |
Key Characteristics
Multimodal Content: Each instance consists of both image and embedded text.
Language Variety: Includes standard Bengali, Bengali-English code-mixed, and code-switched memes.
Annotation Process:
Annotated by three fluent Bengali speakers.
A structured decision tree was used to ensure consistency.
Achieved a Fleiss’ kappa score of 0.79, indicating substantial agreement.
Data Source: Collected from 25 public Facebook groups and pages with high meme activity.
Text Extraction: Text within images was extracted using the Gemini API and manually verified for script accuracy.
Significance
This dataset provides a valuable benchmark for research in multimodal hate speech detection, particularly for low-resource languages like Bengali. Its distinction between hate and inflammatory content enables more nuanced modelling and analysis of harmful online behaviour.
数据集概览:Bn-HIB(孟加拉语仇恨-煽动性-良性,Bangla Hate–Inflammatory–Benign)数据集是一款专为检测孟加拉语模因(memes)中的有害内容而开发的新型多模态资源。该数据集包含3,247条经人工标注的模因,是首个在孟加拉语语境中明确区分煽动性内容与直接仇恨言论的数据集。
数据划分:该数据集被划分为三个标准子集:
训练集(占比70%):2,272个样本
验证集(占比15%):487个样本
测试集(占比15%):488个样本
| 类别 | 训练集 | 验证集 | 测试集 | 总计 |
| ----------------- | --------- | ---------- | ------- | --------- |
| 仇恨类(HM) | 811 | 174 | 173 | 1,158 |
| 煽动性类(IM) | 773 | 166 | 167 | 1,106 |
| 良性类(BM) | 688 | 147 | 148 | 983 |
| 总计 | 2,272 | 487 | 488 | 3,247 |
关键特征
多模态内容:每条样本均同时包含图像与内嵌文本。
语言多样性:涵盖标准孟加拉语、孟加拉语-英语代码混合式模因与代码转换式模因。
标注流程:由三名熟练掌握孟加拉语的标注人员完成标注;采用结构化决策树以保障标注一致性;最终获得的弗莱伊斯kappa(Fleiss’ kappa)系数为0.79,表明标注者间存在显著一致性。
数据来源:从25个模因活跃度较高的公开Facebook群组与页面采集得到。
文本提取:图像内的文本通过Gemini API提取,并经人工核验文本脚本的准确性。
研究意义:该数据集为多模态仇恨言论检测研究提供了极具价值的基准测试资源,尤其适用于孟加拉语这类低资源语言。其对仇恨内容与煽动性内容的明确区分,能够支持针对有害网络行为开展更精细化的建模与分析。
提供机构:
Mendeley Data
创建时间:
2026-04-15



