shiftproject_test
收藏魔搭社区2025-12-05 更新2025-06-07 收录
下载链接:
https://modelscope.cn/datasets/vidore/shiftproject_test
下载链接
链接失效反馈官方服务:
资源简介:
## Dataset Description
This dataset is part of a topic-specific retrieval benchmark spanning multiple domains, which evaluates retrieval in more realistic industrial applications.
It includes French documents from the [Shift Project](https://theshiftproject.org/) about the **environment**.
Having a dataset in French allows *ViDoRe* to evaluate the multilingual ability of a retrieval model.
### Data Collection
We collected 5 large documents from the Shift Project reports, totalling 1,000 document pages per topic. We associated these with 100 questions and answers generated using Claude-3 Sonnet, a high-quality proprietary vision-language model.
### Data Curation
As the queries (and answers) are generated using a Vison Language Model, human annotators extensively filtered them for quality and relevance.
### Load the dataset
```python
from datasets import load_dataset
ds = load_dataset("vidore/shiftproject_test", split="test")
```
### Dataset Structure
Here is an example of a dataset instance structure:
```json
features:
- name: query
dtype: string
- name: image
dtype: image
- name: image_filename
dtype: string
- name: answer
dtype: string
- name: page
dtype: string
- name: model
dtype: string
- name: prompt
dtype: string
- name: source
dtype: string
```
## Citation Information
If you use this dataset in your research, please cite the original dataset as follows:
```latex
@misc{faysse2024colpaliefficientdocumentretrieval,
title={ColPali: Efficient Document Retrieval with Vision Language Models},
author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Céline Hudelot and Pierre Colombo},
year={2024},
eprint={2407.01449},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2407.01449},
}
```
## 数据集说明
本数据集属于跨多领域的主题特定检索基准集,旨在于更贴合实际的工业应用场景中评估检索性能。其包含来自[Shift Project](https://theshiftproject.org/)的法语环境主题文档。该法语数据集可支持*ViDoRe*评估检索模型的多语言能力。
### 数据收集
我们从Shift Project的报告中收集了5份大型文档,单主题总计包含1000个文档页面。我们将这些文档与由Claude-3 Sonnet(一款高品质闭源视觉语言模型)生成的100组问答对进行关联。
### 数据整理
由于查询(及答案)由视觉语言模型生成,人工标注员对其进行了全面筛选,以确保内容质量与相关性。
### 加载数据集
python
from datasets import load_dataset
ds = load_dataset("vidore/shiftproject_test", split="test")
### 数据集结构
以下为数据集实例的结构示例:
json
features:
- name: "query"
dtype: string
- name: "image"
dtype: image
- name: "image_filename"
dtype: string
- name: "answer"
dtype: string
- name: "page"
dtype: string
- name: "model"
dtype: string
- name: "prompt"
dtype: string
- name: "source"
dtype: string
## 引用说明
若您在研究中使用本数据集,请按以下方式引用原数据集:
latex
@misc{faysse2024colpaliefficientdocumentretrieval,
title={ColPali: Efficient Document Retrieval with Vision Language Models},
author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Céline Hudelot and Pierre Colombo},
year={2024},
eprint={2407.01449},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2407.01449},
}
提供机构:
maas
创建时间:
2025-06-04



