VDR_MEGA_MultiDomain_DocRetrieval

Name: VDR_MEGA_MultiDomain_DocRetrieval
Creator: maas
Published: 2025-12-05 11:58:17
License: 暂无描述

魔搭社区2025-12-05 更新2025-12-06 收录

下载链接：

https://modelscope.cn/datasets/racineai/VDR_MEGA_MultiDomain_DocRetrieval

下载链接

链接失效反馈

官方服务：

资源简介：

# Visual Document Retrieval Dataset ## Overview This dataset is designed for training visual document retrieval models. It combines multiple datasets from the VDR series, Colpali, and LlamaIndex to create the most comprehensive training resource for visual document retrieval tasks. ## Dataset Structure The dataset contains structured fields including unique identifiers with string lengths ranging from 45 to 50 characters, search query text with variable lengths between 5 and 336 characters, and language classifications across 5 distinct values. Each entry includes a number of negative examples ranging from 0 to 16 integers, accompanied by a primary document image with widths spanning 366 to 5310 pixels. Additional negative example images are provided through fields negative_image_0 to negative_image_15, featuring widths between 622 and 827 pixels. ## Language Distribution The dataset encompasses content across five languages with approximately 1,090,000 total examples: | Language | Examples | Percentage | |----------|----------|------------| | English (en) | ~700,770 | 64.3% | | French (fr) | ~224,540 | 20.6% | | German (de) | ~56,680 | 5.2% | | Spanish (es) | ~56,680 | 5.2% | | Italian (it) | ~52,320 | 4.8% | | **Total** | **~1,090,000** | **100%** | ## Purpose This dataset serves as a comprehensive training resource for visual document retrieval models by providing both positive and negative examples to enhance model discrimination capabilities. The dataset optimizes training efficiency by including examples with and without negative samples, allowing models to learn from diverse training scenarios. The multilingual composition ensures robust performance across different languages and diverse document types. The extensive negative sampling mechanism supports contrastive learning approaches essential for effective visual document retrieval model training. ## Data Sources This dataset represents a strategic fusion of established datasets from multiple sources: ### VDR Series - [racineai/VDR_Military](https://huggingface.co/datasets/racineai/VDR_Military) - Military domain documents (187k examples) - [racineai/VDR_Energy](https://huggingface.co/datasets/racineai/VDR_Energy) - Energy sector documents (160k examples) - [racineai/VDR_Geotechnie](https://huggingface.co/datasets/racineai/VDR_Geotechnie) - Geotechnical engineering documents (68.3k examples) - [racineai/VDR_Hydrogen](https://huggingface.co/datasets/racineai/VDR_Hydrogen) - Hydrogen technology documents ### Visual Document Retrieval - [vidore/colpali_train_set](https://huggingface.co/datasets/vidore/colpali_train_set) - Core training examples for visual document understanding - [openbmb/VisRAG-Ret-Train-Synthetic-data](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Train-Synthetic-data) - Synthetic visual retrieval training data - [llamaindex/vdr-multilingual-train](https://huggingface.co/datasets/llamaindex/vdr-multilingual-train) - Multilingual training dataset for visual document retrieval - This consolidation creates the largest and most complete dataset currently available for visual document retrieval model training, combining the strengths and coverage of each contributing source to maximize training effectiveness. ``` License : This dataset is released under the Apache 2.0 License.

# 视觉文档检索数据集（Visual Document Retrieval Dataset） ## 概述本数据集专为视觉文档检索模型训练打造，整合了VDR系列、Colpali以及LlamaIndex的多个数据集，构建出目前最全面的视觉文档检索任务训练资源。 ## 数据集结构本数据集包含标准化结构化字段：长度为45至50个字符的字符串型唯一标识符、长度介于5至336字符的可变长度搜索查询文本，以及覆盖5个类别的语言分类标签。每条样本包含0至16个负样本，同时附带一张主文档图像，其宽度范围为366至5310像素。额外的负样本图像通过negative_image_0至negative_image_15字段提供，这些图像的宽度介于622至827像素之间。 ## 语言分布本数据集涵盖5种语言的内容，总样本量约为109万： | 语言 | 样本数 | 占比 | |----------|----------|------------| | 英语（en） | ~700,770 | 64.3% | | 法语（fr） | ~224,540 | 20.6% | | 德语（de） | ~56,680 | 5.2% | | 西班牙语（es） | ~56,680 | 5.2% | | 意大利语（it） | ~52,320 | 4.8% | | **总计** | **~1,090,000** | **100%** | ## 数据集用途本数据集作为视觉文档检索模型的全面训练资源，通过提供正负样本对以强化模型的判别能力。数据集同时涵盖带有负样本与不带负样本的样本，优化了训练效率，使模型能够从多样化的训练场景中学习。多语言构成确保了模型在不同语言与多样文档类型下均能实现稳健的性能表现。丰富的负采样机制支持对比学习方法，而对比学习是高效训练视觉文档检索模型的核心要素。 ## 数据来源本数据集通过战略整合多来源的成熟数据集构建而成： ### VDR系列 - [racineai/VDR_Military](https://huggingface.co/datasets/racineai/VDR_Military) —— 军事领域文档数据集（18.7万条样本） - [racineai/VDR_Energy](https://huggingface.co/datasets/racineai/VDR_Energy) —— 能源领域文档数据集（16万条样本） - [racineai/VDR_Geotechnie](https://huggingface.co/datasets/racineai/VDR_Geotechnie) —— 岩土工程文档数据集（6.83万条样本） - [racineai/VDR_Hydrogen](https://huggingface.co/datasets/racineai/VDR_Hydrogen) —— 氢能技术文档数据集 ### 视觉文档检索相关数据集 - [vidore/colpali_train_set](https://huggingface.co/datasets/vidore/colpali_train_set) —— 视觉文档理解核心训练样本集 - [openbmb/VisRAG-Ret-Train-Synthetic-data](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Train-Synthetic-data) —— 合成视觉检索训练数据集 - [llamaindex/vdr-multilingual-train](https://huggingface.co/datasets/llamaindex/vdr-multilingual-train) —— 面向视觉文档检索的多语言训练数据集本次整合打造了目前可供视觉文档检索模型训练使用的规模最大、内容最全面的数据集，融合了各贡献数据源的优势与覆盖范围，以最大化训练效果。许可证：本数据集采用Apache 2.0许可证发布。

提供机构：

maas

创建时间：

2025-11-21

5,000+

优质数据集

54 个

任务类型

进入经典数据集