QCRI/Arabic-Hateful-Memes
收藏Hugging Face2026-04-20 更新2026-05-10 收录
下载链接:
https://hf-mirror.com/datasets/QCRI/Arabic-Hateful-Memes
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- ar
license: cc-by-nc-4.0
task_categories:
- image-classification
- text-classification
tags:
- hate-speech
- memes
- arabic
- multimodal
- multi-label
pretty_name: Arabic Hateful Memes (ArHateMeme)
size_categories:
- n<1K
configs:
- config_name: sample_100
data_files:
- split: train
path: sample_100/train-*
default: true
dataset_info:
config_name: sample_100
features:
- name: id
dtype: string
- name: image
dtype: image
- name: text
dtype: string
- name: label
dtype: string
- name: fine_grained_label
sequence: string
splits:
- name: train
num_examples: 100
---
# Arabic Hateful Memes (ArHateMeme) — Public Sample
This repository hosts a **100-example diversity-sampled preview** drawn from the
**training split** of the **ArHateMeme** dataset: 5,000 Arabic memes manually
annotated for hatefulness and fine-grained sub-types. The full dataset will be
released alongside the associated shared task.
> ⚠️ This preview is intended for format inspection, tooling validation, and
> schema alignment only. It is **not** a benchmark and should not be used for
> model evaluation.
---
## About the full dataset
**ArHateMeme** is a multimodal (image + Arabic text) meme dataset annotated for
hate speech in Arabic. It contains **5,000 memes** with a binary hatefulness
label and a **multi-label** set of fine-grained sub-types.
### Annotation
- 500 memes are triple-annotated (calibration / gold test set).
- 4,500 memes are single-annotated by trained annotators.
- Binary labels use majority voting on the triple-annotated subset.
- Fine-grained sub-types are the union of sub-types from annotators whose
binary label matches the majority label.
### Label Taxonomy
| Aspect | Values |
|---|---|
| Binary | `Hateful`, `Not Hateful` |
| Hateful sub-types | Mocking, Incitement, Dehumanization, Slurs, Contempt, Inferiority, Exclusion, Stereotyping, Extremism, Threat, Insults, Historical, Other |
| Non-hateful sub-types | Humor, Sarcasm, Other |
A meme is never assigned both hateful and non-hateful sub-types simultaneously.
### Official splits (full dataset)
| Split | Records | % | Hateful | Not Hateful |
|---|---|---|---|---|
| train | 3,500 | 70% | 1,324 | 2,176 |
| dev | 500 | 10% | 189 | 311 |
| test | 1,000 | 20% | 337 | 663 |
| **Total** | **5,000** | 100% | **1,850** | **3,150** |
All 500 triple-annotated gold memes are in the **test** split. Splits are
stratified by binary label (seed 42) and there is no meme overlap between
splits.
---
## About this preview sample
- **Source split:** `train` (single-annotated bulk memes)
- **Size:** 100 memes
- **Sampling:** stratified to cover **every fine-grained sub-type present in
the training data** and preserve a realistic hateful / non-hateful ratio.
- **Images:** embedded as bytes via the `datasets.Image` feature — no external
files required.
- **Arrow/Parquet:** stored as a Hugging Face `Dataset` (Arrow) and uploaded as
parquet shards so the Hub viewer renders images inline.
### Sample distribution
| Binary | Count |
|---|---|
| Not Hateful | 60 |
| Hateful | 40 |
| Fine-grained sub-type | Count |
|---|---|
| Sarcasm | 27 |
| Humor | 23 |
| Mocking | 19 |
| Incitement | 15 |
| Other | 10 |
| Contempt | 8 |
| Slurs | 8 |
| Dehumanization | 8 |
| Exclusion | 5 |
| Inferiority | 5 |
(Fine-grained counts sum to more than 100 because the label is multi-label.)
---
## Record schema
```python
{
"id": "102396787_870863910087838_...jpg", # string, unique meme id
"image": <PIL.Image>, # embedded bytes, decoded on load
"text": "…", # OCR-extracted meme text (Arabic)
"label": "Hateful" | "Not Hateful", # binary label
"fine_grained_label": ["Mocking", "Incitement"], # multi-label sub-types
}
```
## Usage
```python
from datasets import load_dataset
ds = load_dataset("QCRI/Arabic-Hateful-Memes", split="train")
print(ds)
example = ds[0]
example["image"].show()
print(example["text"], example["label"], example["fine_grained_label"])
```
---
## Intended use and limitations
- **Intended use:** research on Arabic multimodal hate speech detection,
including binary classification, fine-grained sub-type classification, and
vision-language modeling evaluation.
- **Limitations:** memes reflect online discourse and contain offensive and
harmful content. The preview is not balanced and is too small for training
or evaluation. Annotations are partially single-annotator and may contain
noise.
- **Content warning:** this dataset contains text and imagery that is
offensive, discriminatory, or otherwise harmful by design. Handle with care.
## License
Released under **CC BY-NC 4.0** for research use only. Not to be used for
commercial purposes or for training systems that generate harmful content.
## Citation
A citation will be provided when the full dataset is released. Until then,
please cite this repository URL.
提供机构:
QCRI



