colpali_train_set
收藏魔搭社区2025-12-18 更新2024-09-07 收录
下载链接:
https://modelscope.cn/datasets/vidore/colpali_train_set
下载链接
链接失效反馈官方服务:
资源简介:
## Dataset Description
This dataset is the training set of [ColPali](https://huggingface.co/vidore/colpali) it includes 127,460 query-image pairs from both openly available academic datasets (63%) and a synthetic dataset made up
of pages from web-crawled PDF documents and augmented with VLM-generated (Claude-3 Sonnet) pseudo-questions (37%).
Our training set is fully English by design, enabling us to study zero-shot generalization to non-English languages.
| Dataset | #examples (query-page pairs) | Language |
|------------------------------------------|-------------------------------|----------|
| [DocVQA](https://www.docvqa.org/datasets/docvqa) | 39,463 | English |
| [InfoVQA](https://www.docvqa.org/datasets/infographicvqa) | 10,074 | English |
| [TATDQA](https://github.com/NExTplusplus/TAT-DQA) | 13,251 | English |
| [arXivQA](https://huggingface.co/datasets/MMInstruction/ArxivQA) | 10,000 | English |
| Scrapped documents with a wide array of topics covered | 45,940 | English |
| **TOTAL** | **118,695** | **English-only** |
### Data Curation
We explicitly verify no multi-page PDF document is used both ViDoRe and in the train set to prevent evaluation contamination.
### Load the dataset
```python
from datasets import load_dataset
ds = load_dataset("vidore/colpali_train_set", split="train")
```
### Dataset Structure
Here is an example of a dataset instance structure:
```json
features:
- name: image
dtype: image
- name: image_filename
dtype: string
- name: query
dtype: string
- name: answer
dtype: string
- name: source
dtype: string
- name: options
dtype: string
- name: page
dtype: string
- name: model
dtype: string
- name: prompt
dtype: string
- name: answer_type
dtype: string
```
## License
All academic datasets used are here redistributed subsampled and under their original license.
The synthetic datasets we created with public internet data and VLM synthetic queries are released without usage restrictions.
## Citation Information
If you use this dataset in your research, please cite the original dataset as follows:
```latex
@misc{faysse2024colpaliefficientdocumentretrieval,
title={ColPali: Efficient Document Retrieval with Vision Language Models},
author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Céline Hudelot and Pierre Colombo},
year={2024},
eprint={2407.01449},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2407.01449},
}
```
## 数据集说明
本数据集为[ColPali](https://huggingface.co/vidore/colpali)的训练集,包含127460个查询-图像样本对,其中63%来自公开可用的学术数据集,剩余37%来自合成数据集——该合成数据集由网络爬取的PDF文档页面构成,并通过视觉语言模型(Vision Language Model,VLM)生成的伪问题(由Claude-3 Sonnet生成)进行增强。
本训练集在设计上完全采用英文,以便我们研究其对非英语语言的零样本(zero-shot)泛化能力。
| 数据集名称 | 查询-页面样本对数量 | 语言 |
|------------------------------------------|-------------------------------|----------|
| [DocVQA](https://www.docvqa.org/datasets/docvqa) | 39,463 | 英语 |
| [InfoVQA](https://www.docvqa.org/datasets/infographicvqa) | 10,074 | 英语 |
| [TATDQA](https://github.com/NExTplusplus/TAT-DQA) | 13,251 | 英语 |
| [arXivQA](https://huggingface.co/datasets/MMInstruction/ArxivQA) | 10,000 | 英语 |
| 覆盖多类主题的爬取文档 | 45,940 | 英语 |
| **总计** | **118,695** | **仅英语** |
### 数据治理
我们明确验证确保没有多页PDF文档同时出现在ViDoRe与本训练集中,以避免评估污染。
### 数据集加载
python
from datasets import load_dataset
ds = load_dataset("vidore/colpali_train_set", split="train")
### 数据集结构
以下为数据集实例的结构示例:
json
特征列表:
- 字段名:image
数据类型:图像
- 字段名:image_filename
数据类型:字符串
- 字段名:query
数据类型:字符串
- 字段名:answer
数据类型:字符串
- 字段名:source
数据类型:字符串
- 字段名:options
数据类型:字符串
- 字段名:page
数据类型:字符串
- 字段名:model
数据类型:字符串
- 字段名:prompt
数据类型:字符串
- 字段名:answer_type
数据类型:字符串
### 许可证
本数据集使用的全部学术数据集均经过二次采样后重新分发,且遵循其原始许可证协议。
我们使用公开互联网数据与视觉语言模型生成的合成查询所构建的合成数据集,无使用限制。
### 引用信息
若您在研究中使用本数据集,请按如下格式引用原始数据集:
latex
@misc{faysse2024colpaliefficientdocumentretrieval,
title={ColPali: Efficient Document Retrieval with Vision Language Models},
author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Céline Hudelot and Pierre Colombo},
year={2024},
eprint={2407.01449},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2407.01449},
}
提供机构:
maas
创建时间:
2025-06-04



