QCRI/ArMeme
收藏Hugging Face2024-10-08 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/QCRI/ArMeme
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-sa-4.0
task_categories:
- image-classification
- text-classification
- visual-question-answering
language:
- ar
pretty_name: ArMeme
size_categories:
- 1K<n<10K
dataset_info:
features:
- name: id
dtype: string
- name: text
dtype: string
- name: image
dtype: image
- name: img_path
dtype: string
- name: class_label
dtype:
class_label:
names:
'0': not_propaganda
'1': propaganda
'2': not-meme
'3': other
splits:
- name: train
num_bytes: 288878900.171
num_examples: 4007
- name: dev
num_bytes: 45908447.0
num_examples: 584
- name: test
num_bytes: 81787436.176
num_examples: 1134
download_size: 423396230
dataset_size: 416574783.347
---
# ArMeme Dataset
## Overview
ArMeme is the first multimodal Arabic memes dataset that includes both text and images, collected from various social media platforms. It serves as the first resource dedicated to Arabic multimodal research. While the dataset has been annotated to identify propaganda in memes, it is versatile and can be utilized for a wide range of other research purposes, including sentiment analysis, hate speech detection, cultural studies, meme generation, and cross-lingual transfer learning. The dataset opens new avenues for exploring the intersection of language, culture, and visual communication.
## Dataset Structure
The dataset is divided into three splits:
- **Train**: The training set
- **Dev**: The development/validation set
- **Test**: The test set
Each entry in the dataset includes:
- `id`: id corresponds to the entry
- `text`: The textual content associated with the image.
- `image`: The corresponding image.
- `img_path`: The file path to the image.
## How to Use
You can load the dataset using the `datasets` library from Hugging Face:
```python
from datasets import load_dataset
dataset = load_dataset("QCRI/ArMeme")
# Specify the directory where you want to save the dataset
output_dir="./ArMeme/"
# Save the dataset to the specified directory. This will save all splits to the output directory.
dataset.save_to_disk(output_dir)
# If you want to get the raw images from HF dataset format
from PIL import Image
import os
import json
# Directory to save the images
output_dir="./ArMeme/"
os.makedirs(output_dir, exist_ok=True)
# Iterate over the dataset and save each image
for split in ['train','dev','test']:
jsonl_path = os.path.join(output_dir, f"arabic_memes_categorization_{split}.jsonl")
with open(jsonl_path, 'w', encoding='utf-8') as f:
for idx, item in enumerate(dataset[split]):
# Access the image directly as it's already a PIL.Image object
image = item['image']
image_path = os.path.join(output_dir, item['img_path'])
# Ensure the directory exists
os.makedirs(os.path.dirname(image_path), exist_ok=True)
image.save(image_path)
del item['image']
f.write(json.dumps(item, ensure_ascii=False) + '\n')
```
**Language:** Arabic
**Modality:** Multimodal (text + image)
**Number of Samples:** ~6000
## License
This dataset is licensed under the **CC-By-NC-SA-4.0** license.
## Citation
Please find the paper on [ArXiv](https://arxiv.org/pdf/2406.03916v2) and use the bib info below to cite the paper.
```
@inproceedings{alam2024armeme,
title={{ArMeme}: Propagandistic Content in Arabic Memes},
author={Alam, Firoj and Hasnat, Abul and Ahmed, Fatema and Hasan, Md Arid and Hasanain, Maram},
booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
year={2024},
address={Miami, Florida},
month={November 12--16},
publisher={Association for Computational Linguistics},
}
```
license: CC-BY-NC-SA-4.0
任务类别:
- 图像分类(image-classification)
- 文本分类(text-classification)
- 视觉问答(visual-question-answering)
语言:阿拉伯语(ar)
展示名称:ArMeme
样本规模区间:1000 < 样本数 < 10000
数据集信息:
特征:
- 名称:id,数据类型:字符串
- 名称:text,数据类型:字符串
- 名称:image,数据类型:图像
- 名称:img_path,数据类型:字符串
- 名称:class_label,数据类型:分类标签,类别映射如下:
'0':非宣传内容(not_propaganda)
'1':宣传内容(propaganda)
'2':非表情包(not-meme)
'3':其他(other)
划分集:
- 名称:train(训练集),数据字节数:288878900.171,样本数量:4007
- 名称:dev(开发/验证集),数据字节数:45908447.0,样本数量:584
- 名称:test(测试集),数据字节数:81787436.176,样本数量:1134
下载大小:423396230
数据集总大小:416574783.347
# ArMeme 数据集
## 数据集概览
ArMeme是首个涵盖文本与图像的阿拉伯语表情包多模态(multimodal)数据集,采集自各类社交媒体平台,是首个面向阿拉伯语多模态研究的专属资源。尽管该数据集已被标注以识别表情包中的宣传内容,但其应用场景广泛,可用于诸多其他研究方向,包括情感分析、仇恨言论检测、文化研究、表情包生成以及跨语言迁移学习。本数据集为探索语言、文化与视觉传播的交叉领域开辟了全新路径。
## 数据集结构
本数据集分为三个划分集:
- **Train(训练集)**
- **Dev(开发/验证集)**
- **Test(测试集)**
每条数据条目包含以下字段:
- `id`:数据条目唯一标识符
- `text`:与对应图像关联的文本内容
- `image`:对应的图像文件
- `img_path`:图像文件的存储路径
## 使用方法
您可通过Hugging Face的`datasets`库加载本数据集:
python
from datasets import load_dataset
dataset = load_dataset("QCRI/ArMeme")
# 指定数据集保存目录
output_dir="./ArMeme/"
# 将数据集保存至指定目录,该操作会保存所有划分集
dataset.save_to_disk(output_dir)
# 若需从Hugging Face数据集格式中提取原始图像
from PIL import Image
import os
import json
# 用于保存图像的目录
output_dir="./ArMeme/"
os.makedirs(output_dir, exist_ok=True)
# 遍历数据集并保存每张图像
for split in ['train','dev','test']:
jsonl_path = os.path.join(output_dir, f"arabic_memes_categorization_{split}.jsonl")
with open(jsonl_path, 'w', encoding='utf-8') as f:
for idx, item in enumerate(dataset[split]):
# 直接访问图像,其已为PIL.Image对象
image = item['image']
image_path = os.path.join(output_dir, item['img_path'])
# 确保目录存在
os.makedirs(os.path.dirname(image_path), exist_ok=True)
image.save(image_path)
del item['image']
f.write(json.dumps(item, ensure_ascii=False) + '
')
**语言:** 阿拉伯语
**模态:** 多模态(文本+图像)
**样本数量:** 约6000
## 许可证
本数据集采用**CC-BY-NC-SA-4.0**许可证进行授权。
## 引用方式
相关论文可在[ArXiv](https://arxiv.org/pdf/2406.03916v2)获取,请使用以下BibTeX信息引用该论文:
@inproceedings{alam2024armeme,
title={{ArMeme}: Propagandistic Content in Arabic Memes},
author={Alam, Firoj and Hasnat, Abul and Ahmed, Fatema and Hasan, Md Arid and Hasanain, Maram},
booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
year={2024},
address={Miami, Florida},
month={November 12--16},
publisher={Association for Computational Linguistics},
}
提供机构:
QCRI



