five

ViDoSeek

收藏
魔搭社区2026-05-12 更新2026-05-03 收录
下载链接:
https://modelscope.cn/datasets/iic/ViDoSeek
下载链接
链接失效反馈
官方服务:
资源简介:
## 🚀Overview This is the Repo for ViDoSeek, a benchmark specifically designed for visually rich document retrieval-reason-answer, fully suited for evaluation of RAG within large document corpus. - The paper is available at [https://arxiv.org/abs/2502.18017](https://arxiv.org/abs/2502.18017). - ViDoRAG Project: [https://github.com/Alibaba-NLP/ViDoRAG](https://github.com/Alibaba-NLP/ViDoRAG) **ViDoSeek** sets itself apart with its heightened difficulty level, attributed to the multi-document context and the intricate nature of its content types, particularly the Layout category. The dataset contains both single-hop and multi-hop queries, presenting a diverse set of challenges. We have also released the **SlideVQA-Refined** dataset which is refined through our pipeline. This dataset is suitable for evaluating retrieval-augmented generation tasks as well. <!-- ![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/657429d833e5a4bf5b278615/dPq5bf1P2vA0VZ50XKeXz.jpeg) --> <img src="https://cdn-uploads.huggingface.co/production/uploads/657429d833e5a4bf5b278615/dPq5bf1P2vA0VZ50XKeXz.jpeg" style="width: 55%; height: auto;" alt="ViDoSeek"> ## 🔍Dataset Format The annotation is in the form of a JSON file. ```json { "uid": "04d8bb0db929110f204723c56e5386c1d8d21587_2", // Unique identifier to distinguish different queries "query": "What is the temperature of Steam explosion of Pretreatment for Switchgrass and Sugarcane bagasse preparation?", // Query content "reference_answer": "195-205 Centigrade", // Reference answer to the query "meta_info": { "file_name": "Pretreatment_of_Switchgrass.pdf", // Original file name, typically a PDF file "reference_page": [10, 11], // Reference page numbers represented as an array "source_type": "Text", // Type of data source, 2d_layout\Text\Table\Chart "query_type": "Multi-Hop" // Query type, Multi-Hop or Single-Hop } } ``` ## 📚 Download and Pre-Process To use ViDoSeek, you need to download the document files `vidoseek_pdf_document.zip` and query annotations `vidoseek.json`. Optionally, you can use the code we provide to process the dataset and perform inference. The process code is available at [https://github.com/Alibaba-NLP/ViDoRAG/tree/main/scripts](https://github.com/Alibaba-NLP/ViDoRAG/tree/main/scripts). ## 📝 Citation If you find this dataset useful, please consider citing our paper: ```bigquery @article{wang2025vidorag, title={ViDoRAG: Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents}, author={Wang, Qiuchen and Ding, Ruixue and Chen, Zehui and Wu, Weiqi and Wang, Shihang and Xie, Pengjun and Zhao, Feng}, journal={arXiv preprint arXiv:2502.18017}, year={2025} } ```

🚀 数据集概述 本仓库为ViDoSeek基准测试集的官方代码库,该数据集专为视觉丰富型文档检索-推理-问答任务打造,可完美适配大规模文档语料库下的大语言模型(Large Language Model, LLM)检索增强生成(Retrieval-Augmented Generation, RAG)模型评估。 - 相关论文可访问:https://arxiv.org/abs/2502.18017 - ViDoRAG项目仓库:https://github.com/Alibaba-NLP/ViDoRAG **ViDoSeek** 凭借其显著提升的任务难度与同类基准测试形成差异化优势,这一难度源于多文档上下文环境以及复杂多样的内容类型,尤以布局(Layout)类数据源为甚。该数据集同时涵盖单跳(single-hop)与多跳(multi-hop)查询,蕴含丰富多样的挑战场景。我们还通过自研数据处理流水线发布了经过优化的**SlideVQA-Refined**数据集,该数据集同样适用于检索增强生成任务的模型评估。 <img src="https://cdn-uploads.huggingface.co/production/uploads/657429d833e5a4bf5b278615/dPq5bf1P2vA0VZ50XKeXz.jpeg" style="width: 55%; height: auto;" alt="ViDoSeek"> 🔍 数据集格式 标注数据以JSON文件形式存储。 json { "uid": "04d8bb0db929110f204723c56e5386c1d8d21587_2", // 用于区分不同查询的唯一标识符 "query": "柳枝稷与甘蔗渣预处理过程中蒸汽爆破的温度为多少?", // 查询内容 "reference_answer": "195-205 摄氏度", // 该查询的参考答案 "meta_info": { "file_name": "Pretreatment_of_Switchgrass.pdf", // 原始文件名,通常为PDF文件 "reference_page": [10, 11], // 参考页码,以数组形式存储 "source_type": "Text", // 数据源类型,可选值为2d_layout、Text、Table、Chart "query_type": "Multi-Hop" // 查询类型,分为Multi-Hop(多跳)与Single-Hop(单跳) } } 📚 下载与预处理 若需使用ViDoSeek,请下载文档压缩包`vidoseek_pdf_document.zip`与查询标注文件`vidoseek.json`。您可选择使用我们提供的代码处理数据集并完成推理,相关代码可从以下链接获取:https://github.com/Alibaba-NLP/ViDoRAG/tree/main/scripts 📝 引用声明 若您认为本数据集对您的研究有所帮助,请引用如下论文: bigquery @article{wang2025vidorag, title={ViDoRAG: Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents}, author={Wang, Qiuchen and Ding, Ruixue and Chen, Zehui and Wu, Weiqi and Wang, Shihang and Xie, Pengjun and Zhao, Feng}, journal={arXiv preprint arXiv:2502.18017}, year={2025} }
提供机构:
maas
创建时间:
2026-04-09
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作