ArMeme
收藏魔搭社区2025-12-05 更新2025-06-21 收录
下载链接:
https://modelscope.cn/datasets/QCRI/ArMeme
下载链接
链接失效反馈官方服务:
资源简介:
# ArMeme Dataset
## Overview
ArMeme is the first multimodal Arabic memes dataset that includes both text and images, collected from various social media platforms. It serves as the first resource dedicated to Arabic multimodal research. While the dataset has been annotated to identify propaganda in memes, it is versatile and can be utilized for a wide range of other research purposes, including sentiment analysis, hate speech detection, cultural studies, meme generation, and cross-lingual transfer learning. The dataset opens new avenues for exploring the intersection of language, culture, and visual communication.
## Dataset Structure
The dataset is divided into three splits:
- **Train**: The training set
- **Dev**: The development/validation set
- **Test**: The test set
Each entry in the dataset includes:
- `id`: id corresponds to the entry
- `text`: The textual content associated with the image.
- `image`: The corresponding image.
- `img_path`: The file path to the image.
## How to Use
You can load the dataset using the `datasets` library from Hugging Face:
```python
from datasets import load_dataset
dataset = load_dataset("QCRI/ArMeme")
# Specify the directory where you want to save the dataset
output_dir="./ArMeme/"
# Save the dataset to the specified directory. This will save all splits to the output directory.
dataset.save_to_disk(output_dir)
# If you want to get the raw images from HF dataset format
from PIL import Image
import os
import json
# Directory to save the images
output_dir="./ArMeme/"
os.makedirs(output_dir, exist_ok=True)
# Iterate over the dataset and save each image
for split in ['train','dev','test']:
jsonl_path = os.path.join(output_dir, f"arabic_memes_categorization_{split}.jsonl")
with open(jsonl_path, 'w', encoding='utf-8') as f:
for idx, item in enumerate(dataset[split]):
# Access the image directly as it's already a PIL.Image object
image = item['image']
image_path = os.path.join(output_dir, item['img_path'])
# Ensure the directory exists
os.makedirs(os.path.dirname(image_path), exist_ok=True)
image.save(image_path)
del item['image']
f.write(json.dumps(item, ensure_ascii=False) + '\n')
```
**Language:** Arabic
**Modality:** Multimodal (text + image)
**Number of Samples:** ~6000
## License
This dataset is licensed under the **CC-By-NC-SA-4.0** license.
## Citation
Please find the paper on [ArXiv](https://arxiv.org/pdf/2406.03916v2) and use the bib info below to cite the paper.
```
@inproceedings{alam2024armeme,
title={{ArMeme}: Propagandistic Content in Arabic Memes},
author={Alam, Firoj and Hasnat, Abul and Ahmed, Fatema and Hasan, Md Arid and Hasanain, Maram},
booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
year={2024},
address={Miami, Florida},
month={November 12--16},
publisher={Association for Computational Linguistics},
}
```
# ArMeme 数据集
## 概览
ArMeme是首个包含文本与图像的多模态阿拉伯语表情包数据集,采集自多个社交媒体平台,是首个专门面向阿拉伯语多模态研究的资源。尽管该数据集已针对表情包中的宣传内容进行了标注,但其用途广泛,可被应用于情感分析、仇恨言论检测、文化研究、表情包生成以及跨语言迁移学习等众多研究场景,为探索语言、文化与视觉传播的交叉领域开辟了全新路径。
## 数据集结构
该数据集分为三个划分集:
- **训练集(Train)**:训练数据集
- **开发/验证集(Dev)**:开发/验证集
- **测试集(Test)**:测试数据集
数据集中的每条样本包含以下字段:
- `id`:样本唯一标识符
- `text`:与图像关联的文本内容
- `image`:对应的图像
- `img_path`:图像的文件路径
## 使用方法
可通过Hugging Face的`datasets`库加载该数据集:
python
from datasets import load_dataset
dataset = load_dataset("QCRI/ArMeme")
# 指定用于保存数据集的目录
output_dir="./ArMeme/"
# 将数据集保存至指定目录,该操作会将所有划分集保存至输出目录中。
dataset.save_to_disk(output_dir)
# 若需从Hugging Face数据集格式中提取原始图像
from PIL import Image
import os
import json
# 用于保存图像的目录
output_dir="./ArMeme/"
os.makedirs(output_dir, exist_ok=True)
# 遍历数据集并保存每张图像
for split in ['train','dev','test']:
jsonl_path = os.path.join(output_dir, f"arabic_memes_categorization_{split}.jsonl")
with open(jsonl_path, 'w', encoding='utf-8') as f:
for idx, item in enumerate(dataset[split]):
# 直接访问图像,其已为PIL.Image对象
image = item['image']
image_path = os.path.join(output_dir, item['img_path'])
# 确保目录存在
os.makedirs(os.path.dirname(image_path), exist_ok=True)
image.save(image_path)
del item['image']
f.write(json.dumps(item, ensure_ascii=False) + '
')
**语言**:阿拉伯语
**模态类型**:多模态(文本+图像)
**样本数量**:约6000条
## 许可协议
本数据集采用**CC-By-NC-SA-4.0**许可协议进行授权。
## 引用
请参阅[ArXiv](https://arxiv.org/pdf/2406.03916v2)上的相关论文,并使用以下BibTeX信息引用该论文:
@inproceedings{alam2024armeme,
title={{ArMeme}: Propagandistic Content in Arabic Memes},
author={Alam, Firoj and Hasnat, Abul and Ahmed, Fatema and Hasan, Md Arid and Hasanain, Maram},
booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
year={2024},
address={Miami, Florida},
month={November 12--16},
publisher={Association for Computational Linguistics},
}
提供机构:
maas
创建时间:
2025-06-17



