QCRI/ArMeme

Name: QCRI/ArMeme
Creator: QCRI
Published: 2024-10-08 18:47:18
License: 暂无描述

Hugging Face2024-10-08 更新2025-04-12 收录

下载链接：

https://hf-mirror.com/datasets/QCRI/ArMeme

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-nc-sa-4.0 task_categories: - image-classification - text-classification - visual-question-answering language: - ar pretty_name: ArMeme size_categories: - 1K<n<10K dataset_info: features: - name: id dtype: string - name: text dtype: string - name: image dtype: image - name: img_path dtype: string - name: class_label dtype: class_label: names: '0': not_propaganda '1': propaganda '2': not-meme '3': other splits: - name: train num_bytes: 288878900.171 num_examples: 4007 - name: dev num_bytes: 45908447.0 num_examples: 584 - name: test num_bytes: 81787436.176 num_examples: 1134 download_size: 423396230 dataset_size: 416574783.347 --- # ArMeme Dataset ## Overview ArMeme is the first multimodal Arabic memes dataset that includes both text and images, collected from various social media platforms. It serves as the first resource dedicated to Arabic multimodal research. While the dataset has been annotated to identify propaganda in memes, it is versatile and can be utilized for a wide range of other research purposes, including sentiment analysis, hate speech detection, cultural studies, meme generation, and cross-lingual transfer learning. The dataset opens new avenues for exploring the intersection of language, culture, and visual communication. ## Dataset Structure The dataset is divided into three splits: - **Train**: The training set - **Dev**: The development/validation set - **Test**: The test set Each entry in the dataset includes: - `id`: id corresponds to the entry - `text`: The textual content associated with the image. - `image`: The corresponding image. - `img_path`: The file path to the image. ## How to Use You can load the dataset using the `datasets` library from Hugging Face: ```python from datasets import load_dataset dataset = load_dataset("QCRI/ArMeme") # Specify the directory where you want to save the dataset output_dir="./ArMeme/" # Save the dataset to the specified directory. This will save all splits to the output directory. dataset.save_to_disk(output_dir) # If you want to get the raw images from HF dataset format from PIL import Image import os import json # Directory to save the images output_dir="./ArMeme/" os.makedirs(output_dir, exist_ok=True) # Iterate over the dataset and save each image for split in ['train','dev','test']: jsonl_path = os.path.join(output_dir, f"arabic_memes_categorization_{split}.jsonl") with open(jsonl_path, 'w', encoding='utf-8') as f: for idx, item in enumerate(dataset[split]): # Access the image directly as it's already a PIL.Image object image = item['image'] image_path = os.path.join(output_dir, item['img_path']) # Ensure the directory exists os.makedirs(os.path.dirname(image_path), exist_ok=True) image.save(image_path) del item['image'] f.write(json.dumps(item, ensure_ascii=False) + '\n') ``` **Language:** Arabic **Modality:** Multimodal (text + image) **Number of Samples:** ~6000 ## License This dataset is licensed under the **CC-By-NC-SA-4.0** license. ## Citation Please find the paper on [ArXiv](https://arxiv.org/pdf/2406.03916v2) and use the bib info below to cite the paper. ``` @inproceedings{alam2024armeme, title={{ArMeme}: Propagandistic Content in Arabic Memes}, author={Alam, Firoj and Hasnat, Abul and Ahmed, Fatema and Hasan, Md Arid and Hasanain, Maram}, booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP)}, year={2024}, address={Miami, Florida}, month={November 12--16}, publisher={Association for Computational Linguistics}, } ```

license: CC-BY-NC-SA-4.0 任务类别： - 图像分类（image-classification） - 文本分类（text-classification） - 视觉问答（visual-question-answering）语言：阿拉伯语（ar）展示名称：ArMeme 样本规模区间：1000 < 样本数 < 10000 数据集信息：特征： - 名称：id，数据类型：字符串 - 名称：text，数据类型：字符串 - 名称：image，数据类型：图像 - 名称：img_path，数据类型：字符串 - 名称：class_label，数据类型：分类标签，类别映射如下： '0'：非宣传内容（not_propaganda） '1'：宣传内容（propaganda） '2'：非表情包（not-meme） '3'：其他（other）划分集： - 名称：train（训练集），数据字节数：288878900.171，样本数量：4007 - 名称：dev（开发/验证集），数据字节数：45908447.0，样本数量：584 - 名称：test（测试集），数据字节数：81787436.176，样本数量：1134 下载大小：423396230 数据集总大小：416574783.347 # ArMeme 数据集 ## 数据集概览 ArMeme是首个涵盖文本与图像的阿拉伯语表情包多模态（multimodal）数据集，采集自各类社交媒体平台，是首个面向阿拉伯语多模态研究的专属资源。尽管该数据集已被标注以识别表情包中的宣传内容，但其应用场景广泛，可用于诸多其他研究方向，包括情感分析、仇恨言论检测、文化研究、表情包生成以及跨语言迁移学习。本数据集为探索语言、文化与视觉传播的交叉领域开辟了全新路径。 ## 数据集结构本数据集分为三个划分集： - **Train（训练集）** - **Dev（开发/验证集）** - **Test（测试集）** 每条数据条目包含以下字段： - `id`：数据条目唯一标识符 - `text`：与对应图像关联的文本内容 - `image`：对应的图像文件 - `img_path`：图像文件的存储路径 ## 使用方法您可通过Hugging Face的`datasets`库加载本数据集： python from datasets import load_dataset dataset = load_dataset("QCRI/ArMeme") # 指定数据集保存目录 output_dir="./ArMeme/" # 将数据集保存至指定目录，该操作会保存所有划分集 dataset.save_to_disk(output_dir) # 若需从Hugging Face数据集格式中提取原始图像 from PIL import Image import os import json # 用于保存图像的目录 output_dir="./ArMeme/" os.makedirs(output_dir, exist_ok=True) # 遍历数据集并保存每张图像 for split in ['train','dev','test']: jsonl_path = os.path.join(output_dir, f"arabic_memes_categorization_{split}.jsonl") with open(jsonl_path, 'w', encoding='utf-8') as f: for idx, item in enumerate(dataset[split]): # 直接访问图像，其已为PIL.Image对象 image = item['image'] image_path = os.path.join(output_dir, item['img_path']) # 确保目录存在 os.makedirs(os.path.dirname(image_path), exist_ok=True) image.save(image_path) del item['image'] f.write(json.dumps(item, ensure_ascii=False) + ' ') **语言：** 阿拉伯语 **模态：** 多模态（文本+图像） **样本数量：** 约6000 ## 许可证本数据集采用**CC-BY-NC-SA-4.0**许可证进行授权。 ## 引用方式相关论文可在[ArXiv](https://arxiv.org/pdf/2406.03916v2)获取，请使用以下BibTeX信息引用该论文： @inproceedings{alam2024armeme, title={{ArMeme}: Propagandistic Content in Arabic Memes}, author={Alam, Firoj and Hasnat, Abul and Ahmed, Fatema and Hasan, Md Arid and Hasanain, Maram}, booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP)}, year={2024}, address={Miami, Florida}, month={November 12--16}, publisher={Association for Computational Linguistics}, }

提供机构：

QCRI

5,000+

优质数据集

54 个

任务类型

进入经典数据集