whybe-choi/en-vdr-hn
收藏Hugging Face2026-04-26 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/whybe-choi/en-vdr-hn
下载链接
链接失效反馈官方服务:
资源简介:
English Visual Document Retrieval (VDR) Hard Negatives 是一个多模态检索训练集,用于在英文文档页面上微调视觉-文档检索嵌入模型。查询是文本,文档是页面图像,每个挖掘行包含1个正例和7个挖掘的硬负例。硬负例是使用Qwen/Qwen3-VL-Embedding-8B模型挖掘的,挖掘过程在每个源数据集中进行,并排除了同一源数据集中具有相同查询的正例作为负例候选。数据集包含五个配置:corpus(去重后的图像存储,每个唯一页面图像一行)、naive(使用前七个有效检索到的负例)、shifted_by_n(通过跳过前N个候选来获取稍容易的负例,N=5)、marginpos(通过正例分数的绝对边际过滤负例,边际=0.05)和percpos(通过正例分数的百分比阈值过滤负例,阈值=95%)。数据集来源于五个英文VDR训练源,包括VisRAG-Ret-Train-In-domain-data、vdr-multilingual-train (en)、REAL-MM-RAG_FinTabTrainSet_rephrased、colpali_train_set和tatdqa_train。
English Visual Document Retrieval (VDR) Hard Negatives is a multimodal retrieval training set used to fine-tune visual-document retrieval embedding models on English document pages: the query is text, the document is a page image, and each mining row ships 1 positive + 7 mined hard negatives. Hard negatives were mined with Qwen/Qwen3-VL-Embedding-8B. Mining was performed within each source dataset, and positives sharing the same query within the same source dataset were excluded from the negative candidates. The dataset includes five configs: corpus (deduplicated image store, one row per unique page image), naive (top seven valid retrieved negatives), shifted_by_n (slightly easier negatives by skipping the top N candidates, with N=5), marginpos (negatives filtered by an absolute positive-score margin, with margin=0.05), and percpos (negatives filtered by a percentage of the positive score, with threshold=95%). The dataset is sourced from five English VDR training sources: VisRAG-Ret-Train-In-domain-data, vdr-multilingual-train (en), REAL-MM-RAG_FinTabTrainSet_rephrased, colpali_train_set, and tatdqa_train.
提供机构:
whybe-choi



