Omartificial-Intelligence-Space/Pearl-vdr-ar-train-hard-mined

Name: Omartificial-Intelligence-Space/Pearl-vdr-ar-train-hard-mined
Creator: Omartificial-Intelligence-Space
Published: 2026-04-22 10:21:13
License: 暂无描述

Hugging Face2026-04-22 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/Omartificial-Intelligence-Space/Pearl-vdr-ar-train-hard-mined

下载链接

链接失效反馈

官方服务：

资源简介：

Pearl-vdr-ar-train-hard-mined 是一个阿拉伯文化对齐的视觉文档检索（VDR）数据集，包含模型挖掘的硬负样本。该数据集源自预处理后的父数据集，通过用最相似的非匹配图像（根据余弦相似度排名前4）替换其基于元数据的负样本而生成。数据集包含阿拉伯文化主题的文本查询，分为9个主题和19个阿拉伯国家，并配有相应的图像和硬负样本。数据集分为训练、开发和测试集，其中训练集包含48,002行数据。构建过程包括嵌入图像和查询、计算余弦相似度以及挖掘硬负样本。主要用途包括低资源语言多模态检索中的硬负样本挖掘研究、矿工质量基准测试以及多损失训练。数据集引用了多个来源，并提供了Pearl数据集论文的引用信息。

Pearl-vdr-ar-train-hard-mined is an Arabic culturally-aligned Visual Document Retrieval (VDR) dataset with model-mined hard negatives. Derived from a preprocessed parent dataset, it replaces metadata-based negatives with the top-4 most similar non-matching images based on cosine similarity. The dataset includes Arabic text queries about cultural topics, categorized into 9 topics and 19 Arab states, along with corresponding images and hard negatives. It is structured into train, dev, and test splits, with the train split containing 48,002 rows. The construction process involves embedding images and queries, computing cosine similarities, and mining hard negatives. The primary use case is research on hard-negative mining in low-resource language multimodal retrieval, benchmarking miner quality, and multi-loss training. The dataset is attributed to various sources and includes a citation for the Pearl Dataset paper.

提供机构：

Omartificial-Intelligence-Space

5,000+

优质数据集

54 个

任务类型

进入经典数据集