Omartificial-Intelligence-Space/Pearl-vdr-ar-train-preprocessed
收藏Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Omartificial-Intelligence-Space/Pearl-vdr-ar-train-preprocessed
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- ar
license: apache-2.0
pretty_name: Pearl-vdr-ar-train-preprocessed
tags:
- sentence-transformers
- visual-document-retrieval
- multimodal
- arabic
- embedding
task_categories:
- image-to-text
- text-to-image
- sentence-similarity
size_categories:
- 10K<n<100K
configs:
- config_name: train
data_files:
- split: train
path: train/train-*
- config_name: dev
data_files:
- split: train
path: dev/train-*
- config_name: test
data_files:
- split: train
path: test/train-*
---
# Pearl-vdr-ar-train-preprocessed
Arabic culturally-aligned, VDR-style (query, image, hard-negatives) triplets for training multimodal embedding models with Sentence Transformers.
## Dataset structure
Each row contains:
| Column | Type | Description |
|---------------|--------|----------------------------------------------------------------------|
| `query` | string | Arabic text question about the image |
| `category` | string | High-level Arab-culture topic (Music, Landmarks, Cuisine, ...) |
| `country` | string | Country the sample is anchored to (Algeria, Saudi Arabia, ...) |
| `image` | image | The positive document image (what the query should retrieve) |
| `negative_0` | image | Hard negative — same category, different image |
| `negative_1` | image | Hard negative — same country, different category |
| `negative_2` | image | Hard negative — same category, different image |
| `negative_3` | image | Random corpus negative |
## Splits
| Config | Rows | Purpose |
|---------|--------|------------------------------------|
| `train` | ~48 k | Finetuning |
| `dev` | 1 000 | `InformationRetrievalEvaluator` during training |
| `test` | 1 000 | Held-out final evaluation |
Splits are stratified by `category` so each split covers the full Arab-culture topic range.
## How to load
```python
from datasets import load_dataset
train = load_dataset("Omartificial-Intelligence-Space/Pearl-vdr-ar-train-preprocessed", "train", split="train")
dev = load_dataset("Omartificial-Intelligence-Space/Pearl-vdr-ar-train-preprocessed", "dev", split="train")
test = load_dataset("Omartificial-Intelligence-Space/Pearl-vdr-ar-train-preprocessed", "test", split="train")
```
## Preprocessing rules
- Image deduplication via `augmented_caption` (rows sharing a caption share an image).
- Hard-negative mining is **metadata-based** (category + country) — fast but could be improved by CLIP/Qwen embedding-based mining.
- Images re-encoded to JPEG q=85, longest side capped at 1280 px.
- Random seed: 42; stratified split by `category`.
## Citation
If you use this dataset or the accompanying benchmarks, please cite our paper:
```bibtex
@inproceedings{alwajih-etal-2025-pearl,
title = "Pearl: A Multimodal Culturally-Aware {A}rabic Instruction Dataset",
author = "Alwajih, Fakhraddin and
Magdy, Samar M. and
El Mekki, Abdellah and
Nacar, Omer and
Nafea, Youssef and
Abdelfadil, Safaa Taher and
Yahya, Abdulfattah Mohammed and
Luqman, Hamzah and
Almarwani, Nada and
Aloufi, Samah and
Qawasmeh, Baraah and
Atou, Houdaifa and
Sibaee, Serry and
Alsayadi, Hamzah A. and
Al-Dhabyani, Walid and
Al-shaibani, Maged S. and
El aatar, Aya and
Qandos, Nour and
Alhamouri, Rahaf and
Ahmad, Samar and
AL-Ghrawi, Mohammed Anwar and
Yacoub, Aminetou and
AbuHweidi, Ruwa and
Lemin, Vatimetou Mohamed and
Abdel-Salam, Reem and
Bashiti, Ahlam and
Ammar, Adel and
Alansari, Aisha and
Ashraf, Ahmed and
Alturayeif, Nora and
Alcoba Inciarte, Alcides and
Elmadany, AbdelRahim A. and
Tourad, Mohamedou Cheikh and
Berrada, Ismail and
Jarrar, Mustafa and
Shehata, Shady and
Abdul-Mageed, Muhammad",
editor = "Christodoulopoulos, Christos and
Chakraborty, Tanmoy and
Rose, Carolyn and
Peng, Violet",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "[https://aclanthology.org/2025.findings-emnlp.1254/](https://aclanthology.org/2025.findings-emnlp.1254/)",
pages = "23048--23079",
ISBN = "979-8-89176-335-7"
}
提供机构:
Omartificial-Intelligence-Space



