whybe-choi/en-vdr-hn

Name: whybe-choi/en-vdr-hn
Creator: whybe-choi
Published: 2026-04-26 12:00:18
License: 暂无描述

Hugging Face2026-04-26 更新2026-05-03 收录

下载链接：

https://hf-mirror.com/datasets/whybe-choi/en-vdr-hn

下载链接

链接失效反馈

官方服务：

资源简介：

English Visual Document Retrieval (VDR) Hard Negatives 是一个多模态检索训练集，用于在英文文档页面上微调视觉-文档检索嵌入模型。查询是文本，文档是页面图像，每个挖掘行包含1个正例和7个挖掘的硬负例。硬负例是使用Qwen/Qwen3-VL-Embedding-8B模型挖掘的，挖掘过程在每个源数据集中进行，并排除了同一源数据集中具有相同查询的正例作为负例候选。数据集包含五个配置：corpus（去重后的图像存储，每个唯一页面图像一行）、naive（使用前七个有效检索到的负例）、shifted_by_n（通过跳过前N个候选来获取稍容易的负例，N=5）、marginpos（通过正例分数的绝对边际过滤负例，边际=0.05）和percpos（通过正例分数的百分比阈值过滤负例，阈值=95%）。数据集来源于五个英文VDR训练源，包括VisRAG-Ret-Train-In-domain-data、vdr-multilingual-train (en)、REAL-MM-RAG_FinTabTrainSet_rephrased、colpali_train_set和tatdqa_train。

English Visual Document Retrieval (VDR) Hard Negatives is a multimodal retrieval training set used to fine-tune visual-document retrieval embedding models on English document pages: the query is text, the document is a page image, and each mining row ships 1 positive + 7 mined hard negatives. Hard negatives were mined with Qwen/Qwen3-VL-Embedding-8B. Mining was performed within each source dataset, and positives sharing the same query within the same source dataset were excluded from the negative candidates. The dataset includes five configs: corpus (deduplicated image store, one row per unique page image), naive (top seven valid retrieved negatives), shifted_by_n (slightly easier negatives by skipping the top N candidates, with N=5), marginpos (negatives filtered by an absolute positive-score margin, with margin=0.05), and percpos (negatives filtered by a percentage of the positive score, with threshold=95%). The dataset is sourced from five English VDR training sources: VisRAG-Ret-Train-In-domain-data, vdr-multilingual-train (en), REAL-MM-RAG_FinTabTrainSet_rephrased, colpali_train_set, and tatdqa_train.

提供机构：

whybe-choi

5,000+

优质数据集

54 个

任务类型

进入经典数据集