five

ArMeme

收藏
魔搭社区2025-12-05 更新2025-06-21 收录
下载链接:
https://modelscope.cn/datasets/QCRI/ArMeme
下载链接
链接失效反馈
官方服务:
资源简介:
# ArMeme Dataset ## Overview ArMeme is the first multimodal Arabic memes dataset that includes both text and images, collected from various social media platforms. It serves as the first resource dedicated to Arabic multimodal research. While the dataset has been annotated to identify propaganda in memes, it is versatile and can be utilized for a wide range of other research purposes, including sentiment analysis, hate speech detection, cultural studies, meme generation, and cross-lingual transfer learning. The dataset opens new avenues for exploring the intersection of language, culture, and visual communication. ## Dataset Structure The dataset is divided into three splits: - **Train**: The training set - **Dev**: The development/validation set - **Test**: The test set Each entry in the dataset includes: - `id`: id corresponds to the entry - `text`: The textual content associated with the image. - `image`: The corresponding image. - `img_path`: The file path to the image. ## How to Use You can load the dataset using the `datasets` library from Hugging Face: ```python from datasets import load_dataset dataset = load_dataset("QCRI/ArMeme") # Specify the directory where you want to save the dataset output_dir="./ArMeme/" # Save the dataset to the specified directory. This will save all splits to the output directory. dataset.save_to_disk(output_dir) # If you want to get the raw images from HF dataset format from PIL import Image import os import json # Directory to save the images output_dir="./ArMeme/" os.makedirs(output_dir, exist_ok=True) # Iterate over the dataset and save each image for split in ['train','dev','test']: jsonl_path = os.path.join(output_dir, f"arabic_memes_categorization_{split}.jsonl") with open(jsonl_path, 'w', encoding='utf-8') as f: for idx, item in enumerate(dataset[split]): # Access the image directly as it's already a PIL.Image object image = item['image'] image_path = os.path.join(output_dir, item['img_path']) # Ensure the directory exists os.makedirs(os.path.dirname(image_path), exist_ok=True) image.save(image_path) del item['image'] f.write(json.dumps(item, ensure_ascii=False) + '\n') ``` **Language:** Arabic **Modality:** Multimodal (text + image) **Number of Samples:** ~6000 ## License This dataset is licensed under the **CC-By-NC-SA-4.0** license. ## Citation Please find the paper on [ArXiv](https://arxiv.org/pdf/2406.03916v2) and use the bib info below to cite the paper. ``` @inproceedings{alam2024armeme, title={{ArMeme}: Propagandistic Content in Arabic Memes}, author={Alam, Firoj and Hasnat, Abul and Ahmed, Fatema and Hasan, Md Arid and Hasanain, Maram}, booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP)}, year={2024}, address={Miami, Florida}, month={November 12--16}, publisher={Association for Computational Linguistics}, } ```

# ArMeme 数据集 ## 概览 ArMeme是首个包含文本与图像的多模态阿拉伯语表情包数据集,采集自多个社交媒体平台,是首个专门面向阿拉伯语多模态研究的资源。尽管该数据集已针对表情包中的宣传内容进行了标注,但其用途广泛,可被应用于情感分析、仇恨言论检测、文化研究、表情包生成以及跨语言迁移学习等众多研究场景,为探索语言、文化与视觉传播的交叉领域开辟了全新路径。 ## 数据集结构 该数据集分为三个划分集: - **训练集(Train)**:训练数据集 - **开发/验证集(Dev)**:开发/验证集 - **测试集(Test)**:测试数据集 数据集中的每条样本包含以下字段: - `id`:样本唯一标识符 - `text`:与图像关联的文本内容 - `image`:对应的图像 - `img_path`:图像的文件路径 ## 使用方法 可通过Hugging Face的`datasets`库加载该数据集: python from datasets import load_dataset dataset = load_dataset("QCRI/ArMeme") # 指定用于保存数据集的目录 output_dir="./ArMeme/" # 将数据集保存至指定目录,该操作会将所有划分集保存至输出目录中。 dataset.save_to_disk(output_dir) # 若需从Hugging Face数据集格式中提取原始图像 from PIL import Image import os import json # 用于保存图像的目录 output_dir="./ArMeme/" os.makedirs(output_dir, exist_ok=True) # 遍历数据集并保存每张图像 for split in ['train','dev','test']: jsonl_path = os.path.join(output_dir, f"arabic_memes_categorization_{split}.jsonl") with open(jsonl_path, 'w', encoding='utf-8') as f: for idx, item in enumerate(dataset[split]): # 直接访问图像,其已为PIL.Image对象 image = item['image'] image_path = os.path.join(output_dir, item['img_path']) # 确保目录存在 os.makedirs(os.path.dirname(image_path), exist_ok=True) image.save(image_path) del item['image'] f.write(json.dumps(item, ensure_ascii=False) + ' ') **语言**:阿拉伯语 **模态类型**:多模态(文本+图像) **样本数量**:约6000条 ## 许可协议 本数据集采用**CC-By-NC-SA-4.0**许可协议进行授权。 ## 引用 请参阅[ArXiv](https://arxiv.org/pdf/2406.03916v2)上的相关论文,并使用以下BibTeX信息引用该论文: @inproceedings{alam2024armeme, title={{ArMeme}: Propagandistic Content in Arabic Memes}, author={Alam, Firoj and Hasnat, Abul and Ahmed, Fatema and Hasan, Md Arid and Hasanain, Maram}, booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP)}, year={2024}, address={Miami, Florida}, month={November 12--16}, publisher={Association for Computational Linguistics}, }
提供机构:
maas
创建时间:
2025-06-17
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作