MohamedRashad/Arabic-VLM-Full-Pearl
收藏Hugging Face2025-12-08 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/MohamedRashad/Arabic-VLM-Full-Pearl
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
dataset_info:
features:
- name: type
dtype: string
- name: augmented_caption
dtype: string
- name: question
dtype: string
- name: answer
dtype: string
- name: country
dtype: string
- name: category
dtype: string
- name: image
dtype: image
splits:
- name: train
num_bytes: 146251873789
num_examples: 309298
download_size: 158827499903
dataset_size: 146251873789
task_categories:
- question-answering
- text-generation
- image-to-text
language:
- ar
pretty_name: The Arabic VLM Dataset (Full Pearl Edition)
size_categories:
- 100K<n<1M
---
# 💎 The Arabic VLM Dataset (Full Pearl Edition)
This repository contains the full, unreviewed dataset comprising 309K multimodal examples. This data was generated automatically using the agentic pipeline developed for the **Pearl** project, as described in our paper.
**Disclaimer:** This is the raw, synthetic data that has **not** been subject to human review. It was generated as part of the data creation process and is released for research purposes. It may contain noise, errors, or inconsistencies. For the high-quality, human-reviewed benchmarks, please see the links below.
## This is a reuploaded version from Google Drive just for Huggingface
You can download the dataset from Google Drive (Doesn't recommend this as it is annoying in extraction):
* **Google Drive:** [https://drive.google.com/drive/folders/1awP5ONLRz2IYRYzWSymoR16QBkP3l3HK?usp=sharing](https://drive.google.com/drive/folders/1awP5ONLRz2IYRYzWSymoR16QBkP3l3HK?usp=sharing)
## Resources
* **Official Website:** [https://pearl.dlnlp.ai/](https://pearl.dlnlp.ai/)
* **Human-Reviewed Benchmarks (Hugging Face):** [https://huggingface.co/collections/UBC-NLP/pearl](https://huggingface.co/collections/UBC-NLP/pearl)
* **Paper (arXiv):** [Pearl: A Multimodal Culturally-Aware Arabic Instruction Dataset](https://arxiv.org/abs/2505.21979)
## Citation (Original Authors)
If you use this dataset or the accompanying benchmarks, please cite our paper:
```bibtex
@inproceedings{alwajih-etal-2025-pearl,
title = "Pearl: A Multimodal Culturally-Aware {A}rabic Instruction Dataset",
author = "Alwajih, Fakhraddin and
Magdy, Samar M. and
El Mekki, Abdellah and
Nacar, Omer and
Nafea, Youssef and
Abdelfadil, Safaa Taher and
Yahya, Abdulfattah Mohammed and
Luqman, Hamzah and
Almarwani, Nada and
Aloufi, Samah and
Qawasmeh, Baraah and
Atou, Houdaifa and
Sibaee, Serry and
Alsayadi, Hamzah A. and
Al-Dhabyani, Walid and
Al-shaibani, Maged S. and
El aatar, Aya and
Qandos, Nour and
Alhamouri, Rahaf and
Ahmad, Samar and
AL-Ghrawi, Mohammed Anwar and
Yacoub, Aminetou and
AbuHweidi, Ruwa and
Lemin, Vatimetou Mohamed and
Abdel-Salam, Reem and
Bashiti, Ahlam and
Ammar, Adel and
Alansari, Aisha and
Ashraf, Ahmed and
Alturayeif, Nora and
Alcoba Inciarte, Alcides and
Elmadany, AbdelRahim A. and
Tourad, Mohamedou Cheikh and
Berrada, Ismail and
Jarrar, Mustafa and
Shehata, Shady and
Abdul-Mageed, Muhammad",
editor = "Christodoulopoulos, Christos and
Chakraborty, Tanmoy and
Rose, Carolyn and
Peng, Violet",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "[https://aclanthology.org/2025.findings-emnlp.1254/](https://aclanthology.org/2025.findings-emnlp.1254/)",
pages = "23048--23079",
ISBN = "979-8-89176-335-7"
}
---
配置项:
- 配置名称:default
数据文件:
- 拆分集:训练集(train)
路径:data/train-*
数据集信息:
特征字段:
- 字段名:type,数据类型:字符串(string)
- 字段名:augmented_caption,数据类型:字符串(string)
- 字段名:question,数据类型:字符串(string)
- 字段名:answer,数据类型:字符串(string)
- 字段名:country,数据类型:字符串(string)
- 字段名:category,数据类型:字符串(string)
- 字段名:image,数据类型:图像(image)
拆分集:
- 拆分集名称:训练集(train),数据字节数:146251873789,样本数量:309298
下载大小:158827499903
数据集总大小:146251873789
任务类别:
- 问答(question-answering)
- 文本生成(text-generation)
- 图像到文本(image-to-text)
语言:阿拉伯语(ar)
数据集友好名称:阿拉伯视觉语言模型数据集(The Arabic VLM Dataset, Full Pearl Edition)
样本规模区间:100K<n<1M
---
# 💎 阿拉伯视觉语言模型数据集(The Arabic VLM Dataset, Full Pearl Edition)
本仓库收录完整的未审核多模态数据集,共计309298个样本。本数据通过为**Pearl**项目开发的智能体流水线自动生成,相关细节已在我们的学术论文中详述。
**免责声明**:本数据集为原始合成数据,**未经人工审核**。其作为数据创建流程的一部分生成,仅用于研究目的发布。该数据集可能包含噪声、错误或不一致之处。若需高质量、经人工审核的基准数据集,请参阅下方链接。
## 本版本为从Google Drive重新上传至Hugging Face的适配版本
您可通过Google Drive下载本数据集(不推荐此方式,解压过程较为繁琐):
* **Google Drive下载链接:** [https://drive.google.com/drive/folders/1awP5ONLRz2IYRYzWSymoR16QBkP3l3HK?usp=sharing]
## 相关资源
* **官方网站:** [https://pearl.dlnlp.ai/]
* **经人工审核的基准数据集(Hugging Face):** [https://huggingface.co/collections/UBC-NLP/pearl]
* **学术论文(arXiv):** 《Pearl:具备文化适配性的多模态阿拉伯语指令数据集》(Pearl: A Multimodal Culturally-Aware Arabic Instruction Dataset)[https://arxiv.org/abs/2505.21979]
## 原作者引用格式
若您使用本数据集或配套基准数据集,请引用以下论文:
bibtex
@inproceedings{alwajih-etal-2025-pearl,
title = "Pearl: A Multimodal Culturally-Aware {A}rabic Instruction Dataset",
author = "Alwajih, Fakhraddin and
Magdy, Samar M. and
El Mekki, Abdellah and
Nacar, Omer and
Nafea, Youssef and
Abdelfadil, Safaa Taher and
Yahya, Abdulfattah Mohammed and
Luqman, Hamzah and
Almarwani, Nada and
Aloufi, Samah and
Qawasmeh, Baraah and
Atou, Houdaifa and
Sibaee, Serry and
Alsayadi, Hamzah A. and
Al-Dhabyani, Walid and
Al-shaibani, Maged S. and
El aatar, Aya and
Qandos, Nour and
Alhamouri, Rahaf and
Ahmad, Samar and
AL-Ghrawi, Mohammed Anwar and
Yacoub, Aminetou and
AbuHweidi, Ruwa and
Lemin, Vatimetou Mohamed and
Abdel-Salam, Reem and
Bashiti, Ahlam and
Ammar, Adel and
Alansari, Aisha and
Ashraf, Ahmed and
Alturayeif, Nora and
Alcoba Inciarte, Alcides and
Elmadany, AbdelRahim A. and
Tourad, Mohamedou Cheikh and
Berrada, Ismail and
Jarrar, Mustafa and
Shehata, Shady and
Abdul-Mageed, Muhammad",
editor = "Christodoulopoulos, Christos and
Chakraborty, Tanmoy and
Rose, Carolyn and
Peng, Violet",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "[https://aclanthology.org/2025.findings-emnlp.1254/](https://aclanthology.org/2025.findings-emnlp.1254/)",
pages = "23048--23079",
ISBN = "979-8-89176-335-7"
}
提供机构:
MohamedRashad



