Name: Omartificial-Intelligence-Space/Pearl-vdr-ar-train-preprocessed
Creator: Omartificial-Intelligence-Space
Published: 2026-04-21 08:04:28
License: 暂无描述

下载链接：

https://hf-mirror.com/datasets/Omartificial-Intelligence-Space/Pearl-vdr-ar-train-preprocessed

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - ar license: apache-2.0 pretty_name: Pearl-vdr-ar-train-preprocessed tags: - sentence-transformers - visual-document-retrieval - multimodal - arabic - embedding task_categories: - image-to-text - text-to-image - sentence-similarity size_categories: - 10K<n<100K configs: - config_name: train data_files: - split: train path: train/train-* - config_name: dev data_files: - split: train path: dev/train-* - config_name: test data_files: - split: train path: test/train-* --- # Pearl-vdr-ar-train-preprocessed Arabic culturally-aligned, VDR-style (query, image, hard-negatives) triplets for training multimodal embedding models with Sentence Transformers. ## Dataset structure Each row contains: | Column | Type | Description | |---------------|--------|----------------------------------------------------------------------| | `query` | string | Arabic text question about the image | | `category` | string | High-level Arab-culture topic (Music, Landmarks, Cuisine, ...) | | `country` | string | Country the sample is anchored to (Algeria, Saudi Arabia, ...) | | `image` | image | The positive document image (what the query should retrieve) | | `negative_0` | image | Hard negative — same category, different image | | `negative_1` | image | Hard negative — same country, different category | | `negative_2` | image | Hard negative — same category, different image | | `negative_3` | image | Random corpus negative | ## Splits | Config | Rows | Purpose | |---------|--------|------------------------------------| | `train` | ~48 k | Finetuning | | `dev` | 1 000 | `InformationRetrievalEvaluator` during training | | `test` | 1 000 | Held-out final evaluation | Splits are stratified by `category` so each split covers the full Arab-culture topic range. ## How to load ```python from datasets import load_dataset train = load_dataset("Omartificial-Intelligence-Space/Pearl-vdr-ar-train-preprocessed", "train", split="train") dev = load_dataset("Omartificial-Intelligence-Space/Pearl-vdr-ar-train-preprocessed", "dev", split="train") test = load_dataset("Omartificial-Intelligence-Space/Pearl-vdr-ar-train-preprocessed", "test", split="train") ``` ## Preprocessing rules - Image deduplication via `augmented_caption` (rows sharing a caption share an image). - Hard-negative mining is **metadata-based** (category + country) — fast but could be improved by CLIP/Qwen embedding-based mining. - Images re-encoded to JPEG q=85, longest side capped at 1280 px. - Random seed: 42; stratified split by `category`. ## Citation If you use this dataset or the accompanying benchmarks, please cite our paper: ```bibtex @inproceedings{alwajih-etal-2025-pearl, title = "Pearl: A Multimodal Culturally-Aware {A}rabic Instruction Dataset", author = "Alwajih, Fakhraddin and Magdy, Samar M. and El Mekki, Abdellah and Nacar, Omer and Nafea, Youssef and Abdelfadil, Safaa Taher and Yahya, Abdulfattah Mohammed and Luqman, Hamzah and Almarwani, Nada and Aloufi, Samah and Qawasmeh, Baraah and Atou, Houdaifa and Sibaee, Serry and Alsayadi, Hamzah A. and Al-Dhabyani, Walid and Al-shaibani, Maged S. and El aatar, Aya and Qandos, Nour and Alhamouri, Rahaf and Ahmad, Samar and AL-Ghrawi, Mohammed Anwar and Yacoub, Aminetou and AbuHweidi, Ruwa and Lemin, Vatimetou Mohamed and Abdel-Salam, Reem and Bashiti, Ahlam and Ammar, Adel and Alansari, Aisha and Ashraf, Ahmed and Alturayeif, Nora and Alcoba Inciarte, Alcides and Elmadany, AbdelRahim A. and Tourad, Mohamedou Cheikh and Berrada, Ismail and Jarrar, Mustafa and Shehata, Shady and Abdul-Mageed, Muhammad", editor = "Christodoulopoulos, Christos and Chakraborty, Tanmoy and Rose, Carolyn and Peng, Violet", booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025", month = nov, year = "2025", address = "Suzhou, China", publisher = "Association for Computational Linguistics", url = "[https://aclanthology.org/2025.findings-emnlp.1254/](https://aclanthology.org/2025.findings-emnlp.1254/)", pages = "23048--23079", ISBN = "979-8-89176-335-7" }

应用场景：