colpali_train_set

Name: colpali_train_set
Creator: maas
Published: 2025-12-18 16:36:58
License: 暂无描述

魔搭社区2025-12-18 更新2024-09-07 收录

下载链接：

https://modelscope.cn/datasets/vidore/colpali_train_set

下载链接

链接失效反馈

官方服务：

资源简介：

## Dataset Description This dataset is the training set of [ColPali](https://huggingface.co/vidore/colpali) it includes 127,460 query-image pairs from both openly available academic datasets (63%) and a synthetic dataset made up of pages from web-crawled PDF documents and augmented with VLM-generated (Claude-3 Sonnet) pseudo-questions (37%). Our training set is fully English by design, enabling us to study zero-shot generalization to non-English languages. | Dataset | #examples (query-page pairs) | Language | |------------------------------------------|-------------------------------|----------| | [DocVQA](https://www.docvqa.org/datasets/docvqa) | 39,463 | English | | [InfoVQA](https://www.docvqa.org/datasets/infographicvqa) | 10,074 | English | | [TATDQA](https://github.com/NExTplusplus/TAT-DQA) | 13,251 | English | | [arXivQA](https://huggingface.co/datasets/MMInstruction/ArxivQA) | 10,000 | English | | Scrapped documents with a wide array of topics covered | 45,940 | English | | **TOTAL** | **118,695** | **English-only** | ### Data Curation We explicitly verify no multi-page PDF document is used both ViDoRe and in the train set to prevent evaluation contamination. ### Load the dataset ```python from datasets import load_dataset ds = load_dataset("vidore/colpali_train_set", split="train") ``` ### Dataset Structure Here is an example of a dataset instance structure: ```json features: - name: image dtype: image - name: image_filename dtype: string - name: query dtype: string - name: answer dtype: string - name: source dtype: string - name: options dtype: string - name: page dtype: string - name: model dtype: string - name: prompt dtype: string - name: answer_type dtype: string ``` ## License All academic datasets used are here redistributed subsampled and under their original license. The synthetic datasets we created with public internet data and VLM synthetic queries are released without usage restrictions. ## Citation Information If you use this dataset in your research, please cite the original dataset as follows: ```latex @misc{faysse2024colpaliefficientdocumentretrieval, title={ColPali: Efficient Document Retrieval with Vision Language Models}, author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Céline Hudelot and Pierre Colombo}, year={2024}, eprint={2407.01449}, archivePrefix={arXiv}, primaryClass={cs.IR}, url={https://arxiv.org/abs/2407.01449}, } ```

## 数据集说明本数据集为[ColPali](https://huggingface.co/vidore/colpali)的训练集，包含127460个查询-图像样本对，其中63%来自公开可用的学术数据集，剩余37%来自合成数据集——该合成数据集由网络爬取的PDF文档页面构成，并通过视觉语言模型（Vision Language Model，VLM）生成的伪问题（由Claude-3 Sonnet生成）进行增强。本训练集在设计上完全采用英文，以便我们研究其对非英语语言的零样本（zero-shot）泛化能力。 | 数据集名称 | 查询-页面样本对数量 | 语言 | |------------------------------------------|-------------------------------|----------| | [DocVQA](https://www.docvqa.org/datasets/docvqa) | 39,463 | 英语 | | [InfoVQA](https://www.docvqa.org/datasets/infographicvqa) | 10,074 | 英语 | | [TATDQA](https://github.com/NExTplusplus/TAT-DQA) | 13,251 | 英语 | | [arXivQA](https://huggingface.co/datasets/MMInstruction/ArxivQA) | 10,000 | 英语 | | 覆盖多类主题的爬取文档 | 45,940 | 英语 | | **总计** | **118,695** | **仅英语** | ### 数据治理我们明确验证确保没有多页PDF文档同时出现在ViDoRe与本训练集中，以避免评估污染。 ### 数据集加载 python from datasets import load_dataset ds = load_dataset("vidore/colpali_train_set", split="train") ### 数据集结构以下为数据集实例的结构示例： json 特征列表： - 字段名：image 数据类型：图像 - 字段名：image_filename 数据类型：字符串 - 字段名：query 数据类型：字符串 - 字段名：answer 数据类型：字符串 - 字段名：source 数据类型：字符串 - 字段名：options 数据类型：字符串 - 字段名：page 数据类型：字符串 - 字段名：model 数据类型：字符串 - 字段名：prompt 数据类型：字符串 - 字段名：answer_type 数据类型：字符串 ### 许可证本数据集使用的全部学术数据集均经过二次采样后重新分发，且遵循其原始许可证协议。我们使用公开互联网数据与视觉语言模型生成的合成查询所构建的合成数据集，无使用限制。 ### 引用信息若您在研究中使用本数据集，请按如下格式引用原始数据集： latex @misc{faysse2024colpaliefficientdocumentretrieval, title={ColPali: Efficient Document Retrieval with Vision Language Models}, author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Céline Hudelot and Pierre Colombo}, year={2024}, eprint={2407.01449}, archivePrefix={arXiv}, primaryClass={cs.IR}, url={https://arxiv.org/abs/2407.01449}, }

提供机构：

maas

创建时间：

2025-06-04

5,000+

优质数据集

54 个

任务类型

进入经典数据集