five

QCRI/Arabic-Hateful-Memes

收藏
Hugging Face2026-04-20 更新2026-05-10 收录
下载链接:
https://hf-mirror.com/datasets/QCRI/Arabic-Hateful-Memes
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - ar license: cc-by-nc-4.0 task_categories: - image-classification - text-classification tags: - hate-speech - memes - arabic - multimodal - multi-label pretty_name: Arabic Hateful Memes (ArHateMeme) size_categories: - n<1K configs: - config_name: sample_100 data_files: - split: train path: sample_100/train-* default: true dataset_info: config_name: sample_100 features: - name: id dtype: string - name: image dtype: image - name: text dtype: string - name: label dtype: string - name: fine_grained_label sequence: string splits: - name: train num_examples: 100 --- # Arabic Hateful Memes (ArHateMeme) — Public Sample This repository hosts a **100-example diversity-sampled preview** drawn from the **training split** of the **ArHateMeme** dataset: 5,000 Arabic memes manually annotated for hatefulness and fine-grained sub-types. The full dataset will be released alongside the associated shared task. > ⚠️ This preview is intended for format inspection, tooling validation, and > schema alignment only. It is **not** a benchmark and should not be used for > model evaluation. --- ## About the full dataset **ArHateMeme** is a multimodal (image + Arabic text) meme dataset annotated for hate speech in Arabic. It contains **5,000 memes** with a binary hatefulness label and a **multi-label** set of fine-grained sub-types. ### Annotation - 500 memes are triple-annotated (calibration / gold test set). - 4,500 memes are single-annotated by trained annotators. - Binary labels use majority voting on the triple-annotated subset. - Fine-grained sub-types are the union of sub-types from annotators whose binary label matches the majority label. ### Label Taxonomy | Aspect | Values | |---|---| | Binary | `Hateful`, `Not Hateful` | | Hateful sub-types | Mocking, Incitement, Dehumanization, Slurs, Contempt, Inferiority, Exclusion, Stereotyping, Extremism, Threat, Insults, Historical, Other | | Non-hateful sub-types | Humor, Sarcasm, Other | A meme is never assigned both hateful and non-hateful sub-types simultaneously. ### Official splits (full dataset) | Split | Records | % | Hateful | Not Hateful | |---|---|---|---|---| | train | 3,500 | 70% | 1,324 | 2,176 | | dev | 500 | 10% | 189 | 311 | | test | 1,000 | 20% | 337 | 663 | | **Total** | **5,000** | 100% | **1,850** | **3,150** | All 500 triple-annotated gold memes are in the **test** split. Splits are stratified by binary label (seed 42) and there is no meme overlap between splits. --- ## About this preview sample - **Source split:** `train` (single-annotated bulk memes) - **Size:** 100 memes - **Sampling:** stratified to cover **every fine-grained sub-type present in the training data** and preserve a realistic hateful / non-hateful ratio. - **Images:** embedded as bytes via the `datasets.Image` feature — no external files required. - **Arrow/Parquet:** stored as a Hugging Face `Dataset` (Arrow) and uploaded as parquet shards so the Hub viewer renders images inline. ### Sample distribution | Binary | Count | |---|---| | Not Hateful | 60 | | Hateful | 40 | | Fine-grained sub-type | Count | |---|---| | Sarcasm | 27 | | Humor | 23 | | Mocking | 19 | | Incitement | 15 | | Other | 10 | | Contempt | 8 | | Slurs | 8 | | Dehumanization | 8 | | Exclusion | 5 | | Inferiority | 5 | (Fine-grained counts sum to more than 100 because the label is multi-label.) --- ## Record schema ```python { "id": "102396787_870863910087838_...jpg", # string, unique meme id "image": <PIL.Image>, # embedded bytes, decoded on load "text": "…", # OCR-extracted meme text (Arabic) "label": "Hateful" | "Not Hateful", # binary label "fine_grained_label": ["Mocking", "Incitement"], # multi-label sub-types } ``` ## Usage ```python from datasets import load_dataset ds = load_dataset("QCRI/Arabic-Hateful-Memes", split="train") print(ds) example = ds[0] example["image"].show() print(example["text"], example["label"], example["fine_grained_label"]) ``` --- ## Intended use and limitations - **Intended use:** research on Arabic multimodal hate speech detection, including binary classification, fine-grained sub-type classification, and vision-language modeling evaluation. - **Limitations:** memes reflect online discourse and contain offensive and harmful content. The preview is not balanced and is too small for training or evaluation. Annotations are partially single-annotator and may contain noise. - **Content warning:** this dataset contains text and imagery that is offensive, discriminatory, or otherwise harmful by design. Handle with care. ## License Released under **CC BY-NC 4.0** for research use only. Not to be used for commercial purposes or for training systems that generate harmful content. ## Citation A citation will be provided when the full dataset is released. Until then, please cite this repository URL.
提供机构:
QCRI
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作