MohamedRashad/Arabic-VLM-Full-Pearl

Name: MohamedRashad/Arabic-VLM-Full-Pearl
Creator: MohamedRashad
Published: 2025-12-08 22:17:00
License: 暂无描述

Hugging Face2025-12-08 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/MohamedRashad/Arabic-VLM-Full-Pearl

下载链接

链接失效反馈

官方服务：

资源简介：

--- configs: - config_name: default data_files: - split: train path: data/train-* dataset_info: features: - name: type dtype: string - name: augmented_caption dtype: string - name: question dtype: string - name: answer dtype: string - name: country dtype: string - name: category dtype: string - name: image dtype: image splits: - name: train num_bytes: 146251873789 num_examples: 309298 download_size: 158827499903 dataset_size: 146251873789 task_categories: - question-answering - text-generation - image-to-text language: - ar pretty_name: The Arabic VLM Dataset (Full Pearl Edition) size_categories: - 100K<n<1M --- # 💎 The Arabic VLM Dataset (Full Pearl Edition) This repository contains the full, unreviewed dataset comprising 309K multimodal examples. This data was generated automatically using the agentic pipeline developed for the **Pearl** project, as described in our paper. **Disclaimer:** This is the raw, synthetic data that has **not** been subject to human review. It was generated as part of the data creation process and is released for research purposes. It may contain noise, errors, or inconsistencies. For the high-quality, human-reviewed benchmarks, please see the links below. ## This is a reuploaded version from Google Drive just for Huggingface You can download the dataset from Google Drive (Doesn't recommend this as it is annoying in extraction): * **Google Drive:** [https://drive.google.com/drive/folders/1awP5ONLRz2IYRYzWSymoR16QBkP3l3HK?usp=sharing](https://drive.google.com/drive/folders/1awP5ONLRz2IYRYzWSymoR16QBkP3l3HK?usp=sharing) ## Resources * **Official Website:** [https://pearl.dlnlp.ai/](https://pearl.dlnlp.ai/) * **Human-Reviewed Benchmarks (Hugging Face):** [https://huggingface.co/collections/UBC-NLP/pearl](https://huggingface.co/collections/UBC-NLP/pearl) * **Paper (arXiv):** [Pearl: A Multimodal Culturally-Aware Arabic Instruction Dataset](https://arxiv.org/abs/2505.21979) ## Citation (Original Authors) If you use this dataset or the accompanying benchmarks, please cite our paper: ```bibtex @inproceedings{alwajih-etal-2025-pearl, title = "Pearl: A Multimodal Culturally-Aware {A}rabic Instruction Dataset", author = "Alwajih, Fakhraddin and Magdy, Samar M. and El Mekki, Abdellah and Nacar, Omer and Nafea, Youssef and Abdelfadil, Safaa Taher and Yahya, Abdulfattah Mohammed and Luqman, Hamzah and Almarwani, Nada and Aloufi, Samah and Qawasmeh, Baraah and Atou, Houdaifa and Sibaee, Serry and Alsayadi, Hamzah A. and Al-Dhabyani, Walid and Al-shaibani, Maged S. and El aatar, Aya and Qandos, Nour and Alhamouri, Rahaf and Ahmad, Samar and AL-Ghrawi, Mohammed Anwar and Yacoub, Aminetou and AbuHweidi, Ruwa and Lemin, Vatimetou Mohamed and Abdel-Salam, Reem and Bashiti, Ahlam and Ammar, Adel and Alansari, Aisha and Ashraf, Ahmed and Alturayeif, Nora and Alcoba Inciarte, Alcides and Elmadany, AbdelRahim A. and Tourad, Mohamedou Cheikh and Berrada, Ismail and Jarrar, Mustafa and Shehata, Shady and Abdul-Mageed, Muhammad", editor = "Christodoulopoulos, Christos and Chakraborty, Tanmoy and Rose, Carolyn and Peng, Violet", booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025", month = nov, year = "2025", address = "Suzhou, China", publisher = "Association for Computational Linguistics", url = "[https://aclanthology.org/2025.findings-emnlp.1254/](https://aclanthology.org/2025.findings-emnlp.1254/)", pages = "23048--23079", ISBN = "979-8-89176-335-7" }

--- 配置项： - 配置名称：default 数据文件： - 拆分集：训练集（train）路径：data/train-* 数据集信息：特征字段： - 字段名：type，数据类型：字符串（string） - 字段名：augmented_caption，数据类型：字符串（string） - 字段名：question，数据类型：字符串（string） - 字段名：answer，数据类型：字符串（string） - 字段名：country，数据类型：字符串（string） - 字段名：category，数据类型：字符串（string） - 字段名：image，数据类型：图像（image）拆分集： - 拆分集名称：训练集（train），数据字节数：146251873789，样本数量：309298 下载大小：158827499903 数据集总大小：146251873789 任务类别： - 问答（question-answering） - 文本生成（text-generation） - 图像到文本（image-to-text）语言：阿拉伯语（ar）数据集友好名称：阿拉伯视觉语言模型数据集（The Arabic VLM Dataset, Full Pearl Edition）样本规模区间：100K<n<1M --- # 💎 阿拉伯视觉语言模型数据集（The Arabic VLM Dataset, Full Pearl Edition）本仓库收录完整的未审核多模态数据集，共计309298个样本。本数据通过为**Pearl**项目开发的智能体流水线自动生成，相关细节已在我们的学术论文中详述。 **免责声明**：本数据集为原始合成数据，**未经人工审核**。其作为数据创建流程的一部分生成，仅用于研究目的发布。该数据集可能包含噪声、错误或不一致之处。若需高质量、经人工审核的基准数据集，请参阅下方链接。 ## 本版本为从Google Drive重新上传至Hugging Face的适配版本您可通过Google Drive下载本数据集（不推荐此方式，解压过程较为繁琐）： * **Google Drive下载链接：** [https://drive.google.com/drive/folders/1awP5ONLRz2IYRYzWSymoR16QBkP3l3HK?usp=sharing] ## 相关资源 * **官方网站：** [https://pearl.dlnlp.ai/] * **经人工审核的基准数据集（Hugging Face）：** [https://huggingface.co/collections/UBC-NLP/pearl] * **学术论文（arXiv）：** 《Pearl：具备文化适配性的多模态阿拉伯语指令数据集》（Pearl: A Multimodal Culturally-Aware Arabic Instruction Dataset）[https://arxiv.org/abs/2505.21979] ## 原作者引用格式若您使用本数据集或配套基准数据集，请引用以下论文： bibtex @inproceedings{alwajih-etal-2025-pearl, title = "Pearl: A Multimodal Culturally-Aware {A}rabic Instruction Dataset", author = "Alwajih, Fakhraddin and Magdy, Samar M. and El Mekki, Abdellah and Nacar, Omer and Nafea, Youssef and Abdelfadil, Safaa Taher and Yahya, Abdulfattah Mohammed and Luqman, Hamzah and Almarwani, Nada and Aloufi, Samah and Qawasmeh, Baraah and Atou, Houdaifa and Sibaee, Serry and Alsayadi, Hamzah A. and Al-Dhabyani, Walid and Al-shaibani, Maged S. and El aatar, Aya and Qandos, Nour and Alhamouri, Rahaf and Ahmad, Samar and AL-Ghrawi, Mohammed Anwar and Yacoub, Aminetou and AbuHweidi, Ruwa and Lemin, Vatimetou Mohamed and Abdel-Salam, Reem and Bashiti, Ahlam and Ammar, Adel and Alansari, Aisha and Ashraf, Ahmed and Alturayeif, Nora and Alcoba Inciarte, Alcides and Elmadany, AbdelRahim A. and Tourad, Mohamedou Cheikh and Berrada, Ismail and Jarrar, Mustafa and Shehata, Shady and Abdul-Mageed, Muhammad", editor = "Christodoulopoulos, Christos and Chakraborty, Tanmoy and Rose, Carolyn and Peng, Violet", booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025", month = nov, year = "2025", address = "Suzhou, China", publisher = "Association for Computational Linguistics", url = "[https://aclanthology.org/2025.findings-emnlp.1254/](https://aclanthology.org/2025.findings-emnlp.1254/)", pages = "23048--23079", ISBN = "979-8-89176-335-7" }

提供机构：

MohamedRashad

5,000+

优质数据集

54 个

任务类型

进入经典数据集