A Multimodal Bangla Meme Dataset for Hate Speech, Sentiment, and Sarcasm Detection with Text–Image Fusion and Lexicon Annotations
收藏Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/d6t8nbkj96
下载链接
链接失效反馈官方服务:
资源简介:
Bangla Multimodal Meme Dataset for Hate Speech, Sarcasm, and Offensive Content Detection
This dataset consists of 5,126 Bangla memes annotated for multiple offensive and contextual attributes including hate speech, sarcasm, vulgarity, violence, humor, and category. The dataset is intended to support multimodal NLP research by combining OCR-extracted Bangla text, image metadata, perceptual image fingerprints (pHash), and lexicon-based linguistic features.
Due to copyright restrictions, the original meme images are not distributed. Instead, the dataset provides:
OCR-extracted Bangla text from each meme
English translations
Perceptual hash (pHash) as a unique image fingerprint
Image metadata (width and height)
Manual annotations for hate speech, sarcasm, vulgarity, violence, humor, and category
A curated Bangla offensive lexicon for auxiliary feature extraction
Researchers can retrieve the original memes using the OCR text via web search and verify exact matches using the provided pHash values. This ensures reproducibility while complying with copyright-safe dataset release practices.
The dataset was annotated by three independent annotators following a shared guideline. Annotation reliability was assessed on a stratified subset of 400 memes using Fleiss’ kappa, demonstrating substantial to near-perfect agreement across labels.
Additionally, the dataset includes a labeled Bangla offensive lexicon containing 441 terms categorized into vulgar, insult, violent, and hate-associated words. These lexicon features provide complementary linguistic signals for multimodal fusion experiments.
This dataset is suitable for research in:
Hate speech detection in Bangla memes
Sarcasm and humor analysis
Offensive language detection
Multimodal text–image fusion models
Low-resource Bangla NLP research
The dataset is released for research and academic use only.
创建时间:
2026-02-02



