five

shiftproject_test

收藏
魔搭社区2025-12-05 更新2025-06-07 收录
下载链接:
https://modelscope.cn/datasets/vidore/shiftproject_test
下载链接
链接失效反馈
官方服务:
资源简介:
## Dataset Description This dataset is part of a topic-specific retrieval benchmark spanning multiple domains, which evaluates retrieval in more realistic industrial applications. It includes French documents from the [Shift Project](https://theshiftproject.org/) about the **environment**. Having a dataset in French allows *ViDoRe* to evaluate the multilingual ability of a retrieval model. ### Data Collection We collected 5 large documents from the Shift Project reports, totalling 1,000 document pages per topic. We associated these with 100 questions and answers generated using Claude-3 Sonnet, a high-quality proprietary vision-language model. ### Data Curation As the queries (and answers) are generated using a Vison Language Model, human annotators extensively filtered them for quality and relevance. ### Load the dataset ```python from datasets import load_dataset ds = load_dataset("vidore/shiftproject_test", split="test") ``` ### Dataset Structure Here is an example of a dataset instance structure: ```json features: - name: query dtype: string - name: image dtype: image - name: image_filename dtype: string - name: answer dtype: string - name: page dtype: string - name: model dtype: string - name: prompt dtype: string - name: source dtype: string ``` ## Citation Information If you use this dataset in your research, please cite the original dataset as follows: ```latex @misc{faysse2024colpaliefficientdocumentretrieval, title={ColPali: Efficient Document Retrieval with Vision Language Models}, author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Céline Hudelot and Pierre Colombo}, year={2024}, eprint={2407.01449}, archivePrefix={arXiv}, primaryClass={cs.IR}, url={https://arxiv.org/abs/2407.01449}, } ```

## 数据集说明 本数据集属于跨多领域的主题特定检索基准集,旨在于更贴合实际的工业应用场景中评估检索性能。其包含来自[Shift Project](https://theshiftproject.org/)的法语环境主题文档。该法语数据集可支持*ViDoRe*评估检索模型的多语言能力。 ### 数据收集 我们从Shift Project的报告中收集了5份大型文档,单主题总计包含1000个文档页面。我们将这些文档与由Claude-3 Sonnet(一款高品质闭源视觉语言模型)生成的100组问答对进行关联。 ### 数据整理 由于查询(及答案)由视觉语言模型生成,人工标注员对其进行了全面筛选,以确保内容质量与相关性。 ### 加载数据集 python from datasets import load_dataset ds = load_dataset("vidore/shiftproject_test", split="test") ### 数据集结构 以下为数据集实例的结构示例: json features: - name: "query" dtype: string - name: "image" dtype: image - name: "image_filename" dtype: string - name: "answer" dtype: string - name: "page" dtype: string - name: "model" dtype: string - name: "prompt" dtype: string - name: "source" dtype: string ## 引用说明 若您在研究中使用本数据集,请按以下方式引用原数据集: latex @misc{faysse2024colpaliefficientdocumentretrieval, title={ColPali: Efficient Document Retrieval with Vision Language Models}, author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Céline Hudelot and Pierre Colombo}, year={2024}, eprint={2407.01449}, archivePrefix={arXiv}, primaryClass={cs.IR}, url={https://arxiv.org/abs/2407.01449}, }
提供机构:
maas
创建时间:
2025-06-04
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作