five

Omartificial-Intelligence-Space/Pearl-vdr-ar-train-preprocessed

收藏
Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Omartificial-Intelligence-Space/Pearl-vdr-ar-train-preprocessed
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - ar license: apache-2.0 pretty_name: Pearl-vdr-ar-train-preprocessed tags: - sentence-transformers - visual-document-retrieval - multimodal - arabic - embedding task_categories: - image-to-text - text-to-image - sentence-similarity size_categories: - 10K<n<100K configs: - config_name: train data_files: - split: train path: train/train-* - config_name: dev data_files: - split: train path: dev/train-* - config_name: test data_files: - split: train path: test/train-* --- # Pearl-vdr-ar-train-preprocessed Arabic culturally-aligned, VDR-style (query, image, hard-negatives) triplets for training multimodal embedding models with Sentence Transformers. ## Dataset structure Each row contains: | Column | Type | Description | |---------------|--------|----------------------------------------------------------------------| | `query` | string | Arabic text question about the image | | `category` | string | High-level Arab-culture topic (Music, Landmarks, Cuisine, ...) | | `country` | string | Country the sample is anchored to (Algeria, Saudi Arabia, ...) | | `image` | image | The positive document image (what the query should retrieve) | | `negative_0` | image | Hard negative — same category, different image | | `negative_1` | image | Hard negative — same country, different category | | `negative_2` | image | Hard negative — same category, different image | | `negative_3` | image | Random corpus negative | ## Splits | Config | Rows | Purpose | |---------|--------|------------------------------------| | `train` | ~48 k | Finetuning | | `dev` | 1 000 | `InformationRetrievalEvaluator` during training | | `test` | 1 000 | Held-out final evaluation | Splits are stratified by `category` so each split covers the full Arab-culture topic range. ## How to load ```python from datasets import load_dataset train = load_dataset("Omartificial-Intelligence-Space/Pearl-vdr-ar-train-preprocessed", "train", split="train") dev = load_dataset("Omartificial-Intelligence-Space/Pearl-vdr-ar-train-preprocessed", "dev", split="train") test = load_dataset("Omartificial-Intelligence-Space/Pearl-vdr-ar-train-preprocessed", "test", split="train") ``` ## Preprocessing rules - Image deduplication via `augmented_caption` (rows sharing a caption share an image). - Hard-negative mining is **metadata-based** (category + country) — fast but could be improved by CLIP/Qwen embedding-based mining. - Images re-encoded to JPEG q=85, longest side capped at 1280 px. - Random seed: 42; stratified split by `category`. ## Citation If you use this dataset or the accompanying benchmarks, please cite our paper: ```bibtex @inproceedings{alwajih-etal-2025-pearl, title = "Pearl: A Multimodal Culturally-Aware {A}rabic Instruction Dataset", author = "Alwajih, Fakhraddin and Magdy, Samar M. and El Mekki, Abdellah and Nacar, Omer and Nafea, Youssef and Abdelfadil, Safaa Taher and Yahya, Abdulfattah Mohammed and Luqman, Hamzah and Almarwani, Nada and Aloufi, Samah and Qawasmeh, Baraah and Atou, Houdaifa and Sibaee, Serry and Alsayadi, Hamzah A. and Al-Dhabyani, Walid and Al-shaibani, Maged S. and El aatar, Aya and Qandos, Nour and Alhamouri, Rahaf and Ahmad, Samar and AL-Ghrawi, Mohammed Anwar and Yacoub, Aminetou and AbuHweidi, Ruwa and Lemin, Vatimetou Mohamed and Abdel-Salam, Reem and Bashiti, Ahlam and Ammar, Adel and Alansari, Aisha and Ashraf, Ahmed and Alturayeif, Nora and Alcoba Inciarte, Alcides and Elmadany, AbdelRahim A. and Tourad, Mohamedou Cheikh and Berrada, Ismail and Jarrar, Mustafa and Shehata, Shady and Abdul-Mageed, Muhammad", editor = "Christodoulopoulos, Christos and Chakraborty, Tanmoy and Rose, Carolyn and Peng, Violet", booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025", month = nov, year = "2025", address = "Suzhou, China", publisher = "Association for Computational Linguistics", url = "[https://aclanthology.org/2025.findings-emnlp.1254/](https://aclanthology.org/2025.findings-emnlp.1254/)", pages = "23048--23079", ISBN = "979-8-89176-335-7" }
提供机构:
Omartificial-Intelligence-Space
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作