DocVQA-2026

Name: DocVQA-2026
Creator: maas
Published: 2026-05-16 22:50:23
License: 暂无描述

魔搭社区2026-05-16 更新2026-03-07 收录

下载链接：

https://modelscope.cn/datasets/VLR-CVC/DocVQA-2026

下载链接

链接失效反馈

官方服务：

资源简介：

<p align="center"> <img src="./assets/banner.png" alt="DocVQA 2026 Competition Banner" width="100%"> </p> <h1 align="center">DocVQA 2026 | ICDAR2026 Competition on Multimodal Reasoning over Documents in Multiple Domains</h1> <p align="center"> <a href="https://www.docvqa.org/challenges/2026"> <img src="https://img.shields.io/badge/🌐_Website-DocVQA.org-orange.svg" alt="Competition Website"> </a> <a href="https://huggingface.co/datasets/VLR-CVC/DocVQA-2026"> <img src="https://img.shields.io/badge/🤗_Hugging_Face-Dataset-blue.svg" alt="Hugging Face Dataset"> </a> <a href="https://github.com/VLR-CVC/DocVQA2026"> <img src="https://img.shields.io/badge/GitHub-Eval_Code-black.svg?logo=github&logoColor=white" alt="GitHub Repository"> </a> <a href="https://rrc.cvc.uab.es/?ch=34"> <img src="https://img.shields.io/badge/RRC-Competition_Platform-green.svg" alt="RRC Competition Platform"> </a> </p> Building upon previous DocVQA benchmarks, this evaluation dataset introduces challenging reasoning questions over a diverse collection of documents spanning eight domains, including business reports, scientific papers, slides, posters, maps, comics, infographics, and engineering drawings. By expanding coverage to new document domains and introducing richer question types, this benchmark seeks to push the boundaries of multimodal reasoning and promote the development of more general, robust document understanding models. ## 🏆 Competition Hosting & Datasets The official DocVQA 2026 competition is hosted on the **Robust Reading Competition (RRC)** platform, which provides the standardized framework for our leaderboards, submissions, and result tracking. <p align="center"> <a href="https://rrc.cvc.uab.es/?ch=34" style="background-color: #007bff; color: white; padding: 12px 24px; text-decoration: none; border-radius: 6px; font-weight: bold; font-size: 18px; display: inline-block;"> Join the Challenge on the RRC Platform </a> </p> The benchmark includes: - **Validation set** — contains public answers and is intended for local development and experimentation. It can be evaluated locally using the official evaluation code or online via the RRC platform. - **Test set** — contains **private answers** and is used for the official competition ranking. It can only be evaluated through the official RRC platform. ## 📋 Participation Requirements To participate in the competition: 1. A method must be submitted on the **test set by April 3, 2026** on the RRC platform. 2. A **one or two page report** must be submitted by email to **docvqa@cvc.uab.cat** by **April 17, 2026**. These reports will be included in the competition publication in the proceedings of the **International Conference on Document Analysis and Recognition (ICDAR)**, held in **Vienna, Austria**. ## 📊 Competition Categories There are **three participation categories**, depending on the total number of parameters of the submitted method. This count must include, all parameters whether active or not, and all parameters across all models used in agentic systems. Categories: - **Up to 8B parameters** - **Over 8B parameters and up to 35B** - **Over 35B parameters** ## Load & Inspect the Data ```python from datasets import load_dataset from PIL import Image # This line will allow for loading the largest images in the dataset Image.MAX_IMAGE_PIXELS = None # 1. Load the dataset dataset = load_dataset("VLR-CVC/DocVQA-2026", split="val") # 2. Access a single sample (one document) sample = dataset[0] doc_id = sample["doc_id"] category = sample["doc_category"] print(f"Document ID: {doc_id} ({category})") # 3. Access Images # 'document' is a list of PIL Images (one for each page) images = sample["document"] print(f"Number of pages: {len(images)}") images[0].show() # 4. Access Questions and Answers questions = sample["questions"] answers = sample["answers"] # 5. Visualize Q&A pairs for a document for q, q_id, a in zip(questions['question'], questions['question_id'], answers['answer']): print("-" * 50) print(f"Question ID: {q_id}") print(f"Question: {q}") print(f"Answer: {a}") print("-" * 50) ``` ## Structure of a Sample <details> <summary><b>Click to expand the JSON structure</b></summary> ```json { "doc_id": "maps_2", "doc_category": "maps", "preview": "<image>", "document": [ "<image>" ], "questions": { "question_id": [ "maps_2_q1", "maps_2_q2", "maps_2_q3", "maps_2_q4", "maps_2_q5" ], "question": [ "By which kind of road are Colchester and Yantic connected?", "Which is the most populated town in the E-10 coordinates?", "What is the milage between Taunton and Dedham? Do not provide the unit.", "From Worcester I take highway 140 towards Taunton, I take the second macadam & gravel road that I encounter, continuing on that road, what town do I reach?", "If I follow highway 109 from Pittsfield to Northampton, how many towns do I cross (without counting start and ending location)?" ] }, "answers": { "question_id": [ "maps_2_q1", "maps_2_q2", "maps_2_q3", "maps_2_q4", "maps_2_q5" ], "answer": [ "Macadam & Gravel", "Wareham", "27", "Woonsocket", "7" ] } } ``` </details> ## Results <p align="center"> <img src="./assets/results_chart.jpg" alt="DocVQA 2026 Results Chart" width="80%"> <br> <em>Figure 1: Performance comparison across domains.</em> </p> <div align="center"> <table> <thead> <tr> <th align="left">Category</th> <th align="center">Gemini 3 Pro Preview</th> <th align="center">GPT-5.2</th> <th align="center">Gemini 3 Flash Preview</th> <th align="center">GPT-5 Mini</th> </tr> </thead> <tbody> <tr> <td align="left"><b>Overall Accuracy</b></td> <td align="center"><b>0.375</b></td> <td align="center">0.350</td> <td align="center">0.3375</td> <td align="center">0.225</td> </tr> <tr> <td align="left">Business Report</td> <td align="center">0.400</td> <td align="center"><b>0.600</b></td> <td align="center">0.200</td> <td align="center">0.300</td> </tr> <tr> <td align="left">Comics</td> <td align="center">0.300</td> <td align="center">0.200</td> <td align="center"><b>0.400</b></td> <td align="center">0.100</td> </tr> <tr> <td align="left">Engineering Drawing</td> <td align="center">0.300</td> <td align="center">0.300</td> <td align="center"><b>0.500</b></td> <td align="center">0.200</td> </tr> <tr> <td align="left">Infographics</td> <td align="center"><b>0.700</b></td> <td align="center">0.600</td> <td align="center">0.500</td> <td align="center">0.500</td> </tr> <tr> <td align="left">Maps</td> <td align="center">0.000</td> <td align="center"><b>0.200</b></td> <td align="center">0.000</td> <td align="center">0.100</td> </tr> <tr> <td align="left">Science Paper</td> <td align="center">0.300</td> <td align="center">0.400</td> <td align="center"><b>0.500</b></td> <td align="center">0.100</td> </tr> <tr> <td align="left">Science Poster</td> <td align="center"><b>0.300</b></td> <td align="center">0.000</td> <td align="center">0.200</td> <td align="center">0.000</td> </tr> <tr> <td align="left">Slide</td> <td align="center"><b>0.700</b></td> <td align="center">0.500</td> <td align="center">0.400</td> <td align="center">0.500</td> </tr> </tbody> </table> </div> > [!NOTE] > **Evaluation Parameters:** > * **GPT Models:** "High thinking" enabled, temperature set to `1.0`. > * **Gemini Models:** "High thinking" enabled, temperature set to `1.0`. > [!WARNING] > **API Constraints:** Both models were evaluated via their respective APIs. If a sample fails because the input files are too large, the result counts as a failure. For example, the file input limit for OpenAI models is 50MB, and several comics in this dataset surpass that threshold. -------- ## 📝 Submission Guidelines & Formatting Rules To ensure fair and accurate evaluation across all participants, submissions are evaluated using automated metrics. Therefore, all model outputs must strictly adhere to the following formatting rules: * **Source Adherence:** Only provide answers found directly within the document. If the question is unanswerable given the provided image, the response must be exactly: `"Unknown"`. * **Multiple Answers:** List multiple answers in their order of appearance, separated by a comma and a single space. **Do not** use the word "and". *(Example: `Answer A, Answer B`)* * **Numbers & Units:** Convert units to their standardized abbreviations (e.g., use `kg` instead of "kilograms", `m` instead of "meters"). Always place a single space between the number and the unit. *(Example: `50 kg`, `10 USD`)* * **Percentages:** Attach the `%` symbol directly to the number with no space. *(Example: `50%`)* * **Dates:** Convert all dates to the standardized `YYYY-MM-DD` format. *(Example: "Jan 1st 24" becomes `2024-01-01`)* * **Decimals:** Use a single period (`.`) as a decimal separator, never a comma. *(Example: `3.14`)* * **Thousands Separator:** Do not use commas to separate large numbers. *(Example: `1000`, not `1,000`)* * **No Filler Text:** Output **only** the requested data. Do not frame your answer in full sentences (e.g., avoid "The answer is..."). **Final Output Format:** When generating the final extracted data, your system must prefix the response with the following exact phrasing: ```text FINAL ANSWER: [Your formatted answer] ``` --------- ## Evaluation Code & Baselines To ensure consistency and fairness, all submissions are evaluated using our official automated evaluation pipeline. This pipeline handles the extraction of your model's answers and applies both strict formatting checks (for numbers, dates, and units) and relaxed text matching (ANLS) for text-based answers. You can find the complete, ready-to-use evaluation script in our official GitHub repository: 🖥️ **[VLR-CVC/DocVQA2026 GitHub Repository](https://github.com/VLR-CVC/DocVQA2026)** ### What you will find in the repository: * **The Evaluator Script:** The core logic used to parse your model's outputs and calculate the final scores. You can use this script to test and evaluate your predictions locally before making an official submission. * **The Baseline Master Prompt:** We have included the exact prompt structure (`get_evaluation_prompt()`) used for our baseline experiments. This prompt is heavily engineered to enforce the competition's mandatory reasoning protocols and strict output formatting. We highly recommend reviewing both the evaluation script and the Master Prompt. You are welcome to use the provided prompt out-of-the-box or adapt it to better guide your own custom models! ## Dataset Structure The dataset consists of: 1. **Images:** High-resolution PNG renders of document pages located in the `images/` directory. 2. **Annotations:** A Parquet file (`val.parquet`) containing the questions, answers, and references to the image paths. ## Contact For questions, technical support, or inquiries regarding the DocVQA 2026 dataset and competition framework: **docvqa@cvc.uab.cat** For participation, leaderboard, and submissions please use the **RRC platform**: https://rrc.cvc.uab.es/?ch=34

<p align="center"> <img src="./assets/banner.png" alt="DocVQA 2026竞赛宣传banner" width="100%"> </p> <h1 align="center">文档视觉问答（DocVQA）2026 | 2026年国际文档分析与识别大会（ICDAR）多领域文档多模态推理竞赛</h1> <p align="center"> <a href="https://www.docvqa.org/challenges/2026"> <img src="https://img.shields.io/badge/🌐_Website-DocVQA.org-orange.svg" alt="竞赛官网"> </a> <a href="https://huggingface.co/datasets/VLR-CVC/DocVQA-2026"> <img src="https://img.shields.io/badge/🤗_Hugging_Face-Dataset-blue.svg" alt="Hugging Face数据集"> </a> <a href="https://github.com/VLR-CVC/DocVQA2026"> <img src="https://img.shields.io/badge/GitHub-Eval_Code-black.svg?logo=github&logoColor=white" alt="GitHub评测代码仓库"> </a> <a href="https://rrc.cvc.uab.es/?ch=34"> <img src="https://img.shields.io/badge/RRC-Competition_Platform-green.svg" alt="RRC竞赛平台"> </a> </p> 本评测数据集基于既往DocVQA基准数据集构建，收录了覆盖八大类文档的高挑战性推理问题，所涉文档类型涵盖商业报告、学术论文、幻灯片、海报、地图、漫画、信息图表以及工程图纸。本基准通过拓展文档领域覆盖范围并引入更丰富的问题类型，旨在突破多模态推理的边界，推动更通用、鲁棒的文档理解模型发展。 ## 🏆 竞赛主办与数据集 DocVQA 2026官方竞赛依托**稳健阅读竞赛（Robust Reading Competition, RRC）**平台举办，该平台为我们的排行榜、提交流程与结果追踪提供了标准化框架。 <p align="center"> <a href="https://rrc.cvc.uab.es/?ch=34" style="background-color: #007bff; color: white; padding: 12px 24px; text-decoration: none; border-radius: 6px; font-weight: bold; font-size: 18px; display: inline-block;"> 前往RRC平台参与竞赛 </a> </p> 本基准数据集包含： - **验证集**：包含公开答案，用于本地开发与实验。可通过官方评测代码在本地完成评估，或通过RRC平台在线评测。 - **测试集**：包含**私有答案**，用于官方竞赛排名。仅可通过官方RRC平台完成评测。 ## 📋 参赛要求若要参与本次竞赛，需满足以下要求： 1. 于**2026年4月3日之前**在RRC平台提交针对测试集的模型方法结果。 2. 于**2026年4月17日之前**通过邮件向**docvqa@cvc.uab.cat**提交**1至2页的竞赛报告**。所有提交的报告将被收录于将于**奥地利维也纳**举办的**国际文档分析与识别大会（International Conference on Document Analysis and Recognition, ICDAR）**会议论文集的竞赛专刊中。 ## 📊 竞赛组别本次竞赛设有**三个参赛组别**，分组依据为提交模型方法的总参数量。参数量统计需包含所有参数（无论是否处于激活状态），以及智能体系统中所使用的全部模型的参数量。组别设置如下： - **参数量不超过80亿** - **参数量超过80亿且不超过350亿** - **参数量超过350亿** ## 数据加载与查看 python from datasets import load_dataset from PIL import Image # 该行代码用于加载数据集中的超大尺寸图像 Image.MAX_IMAGE_PIXELS = None # 1. 加载数据集 dataset = load_dataset("VLR-CVC/DocVQA-2026", split="val") # 2. 访问单个样本（对应一份文档） sample = dataset[0] doc_id = sample["doc_id"] category = sample["doc_category"] print(f"文档ID: {doc_id}（类别：{category}") # 3. 访问图像 # 'document' 为PIL图像列表（每页对应一个图像） images = sample["document"] print(f"文档页数: {len(images)}") images[0].show() # 4. 访问问题与答案 questions = sample["questions"] answers = sample["answers"] # 5. 可视化文档的问答对 for q, q_id, a in zip(questions['question'], questions['question_id'], answers['answer']): print("-" * 50) print(f"问题ID: {q_id}") print(f"问题: {q}") print(f"答案: {a}") print("-" * 50) ## 样本结构 <details> <summary><b>点击展开查看JSON样本结构</b></summary> json { "doc_id": "maps_2", "doc_category": "maps", "preview": "<image>", "document": [ "<image>" ], "questions": { "question_id": [ "maps_2_q1", "maps_2_q2", "maps_2_q3", "maps_2_q4", "maps_2_q5" ], "question": [ "By which kind of road are Colchester and Yantic connected?", "Which is the most populated town in the E-10 coordinates?", "What is the milage between Taunton and Dedham? Do not provide the unit.", "From Worcester I take highway 140 towards Taunton, I take the second macadam & gravel road that I encounter, continuing on that road, what town do I reach?", "If I follow highway 109 from Pittsfield to Northampton, how many towns do I cross (without counting start and ending location)?" ] }, "answers": { "question_id": [ "maps_2_q1", "maps_2_q2", "maps_2_q3", "maps_2_q4", "maps_2_q5" ], "answer": [ "Macadam & Gravel", "Wareham", "27", "Woonsocket", "7" ] } } </details> ## 竞赛结果 <p align="center"> <img src="./assets/results_chart.jpg" alt="DocVQA 2026竞赛结果图表" width="80%"> <br> <em>图1：各领域性能对比。</em> </p> <div align="center"> <table> <thead> <tr> <th align="left">竞赛组别</th> <th align="center">Gemini 3 Pro 预览版</th> <th align="center">GPT-5.2</th> <th align="center">Gemini 3 Flash 预览版</th> <th align="center">GPT-5 Mini</th> </tr> </thead> <tbody> <tr> <td align="left"><b>总体准确率</b></td> <td align="center"><b>0.375</b></td> <td align="center">0.350</td> <td align="center">0.3375</td> <td align="center">0.225</td> </tr> <tr> <td align="left">商业报告</td> <td align="center">0.400</td> <td align="center"><b>0.600</b></td> <td align="center">0.200</td> <td align="center">0.300</td> </tr> <tr> <td align="left">漫画</td> <td align="center">0.300</td> <td align="center">0.200</td> <td align="center"><b>0.400</b></td> <td align="center">0.100</td> </tr> <tr> <td align="left">工程图纸</td> <td align="center">0.300</td> <td align="center">0.300</td> <td align="center"><b>0.500</b></td> <td align="center">0.200</td> </tr> <tr> <td align="left">信息图表</td> <td align="center"><b>0.700</b></td> <td align="center">0.600</td> <td align="center">0.500</td> <td align="center">0.500</td> </tr> <tr> <td align="left">地图</td> <td align="center">0.000</td> <td align="center"><b>0.200</b></td> <td align="center">0.000</td> <td align="center">0.100</td> </tr> <tr> <td align="left">学术论文</td> <td align="center">0.300</td> <td align="center">0.400</td> <td align="center"><b>0.500</b></td> <td align="center">0.100</td> </tr> <tr> <td align="left">学术海报</td> <td align="center"><b>0.300</b></td> <td align="center">0.000</td> <td align="center">0.200</td> <td align="center">0.000</td> </tr> <tr> <td align="left">幻灯片</td> <td align="center"><b>0.700</b></td> <td align="center">0.500</td> <td align="center">0.400</td> <td align="center">0.500</td> </tr> </tbody> </table> </div> > 💡 评测参数说明： > * **GPT系列模型**：开启"深度思考"模式，温度参数设置为`1.0`。 > * **Gemini系列模型**：开启"深度思考"模式，温度参数设置为`1.0`。 > ⚠️ API调用限制： > 两款模型均通过各自官方API进行评测。若某一样本因输入文件过大导致推理失败，则该样本结果计为评测失败。例如，OpenAI模型的文件输入上限为50MB，本数据集中有多份漫画文档超出该阈值。 -------- ## 📝 提交规范与格式要求为确保所有参赛队伍的评测公平准确，本次竞赛采用自动化指标对提交结果进行评估。因此，所有模型的输出必须严格遵循以下格式要求： * **答案来源合规**：仅能从给定文档中直接提取答案。若根据提供的图像无法回答该问题，则输出必须严格为：`"Unknown"`。 * **多答案格式**：按出现顺序列出多个答案，答案间以逗号加单个空格分隔。**严禁**使用"and"连接答案。*(示例：`答案A, 答案B`)* * **数字与单位**：将单位转换为标准缩写（例如用`kg`替代"kilograms"，用`m`替代"meters"）。数字与单位之间需保留单个空格。*(示例：`50 kg`、`10 USD`)* * **百分比格式**：百分号`%`直接紧跟数字，无需空格。*(示例：`50%`)* * **日期格式**：将所有日期转换为标准`YYYY-MM-DD`格式。*(示例：将"Jan 1st 24"转换为`2024-01-01`)* * **小数格式**：使用单个英文句号`.`作为小数分隔符，严禁使用逗号。*(示例：`3.14`)* * **千位分隔符**：大型数字无需使用逗号分隔千位。*(示例：`1000`，而非`1,000`)* * **无冗余文本**：仅输出所需的答案数据，请勿使用完整句子表述答案（例如避免使用"答案是……"这类表述）。 **最终输出格式**：生成最终提取的答案时，必须在回复前添加以下固定前缀： text FINAL ANSWER: [Your formatted answer] --------- ## 评测代码与基线模型为确保评测一致性与公平性，所有提交结果均通过官方自动化评测流水线进行评估。该流水线负责提取模型输出的答案，并同时应用严格的格式校验（针对数字、日期与单位）以及针对文本答案的近似归一化Levenshtein相似度（ANLS）指标。你可以在官方GitHub仓库中获取完整的可直接使用的评测脚本：🖥️ **[VLR-CVC/DocVQA2026 GitHub仓库](https://github.com/VLR-CVC/DocVQA2026)** ### 仓库中包含以下内容： * **评测脚本**：用于解析模型输出并计算最终得分的核心逻辑代码。你可以在正式提交前使用该脚本在本地测试与评估你的预测结果。 * **基线模型主提示词**：我们已包含基线实验中使用的完整提示词结构（`get_evaluation_prompt()`）。该提示词经过精心工程化设计，可强制遵循竞赛要求的推理流程与严格的输出格式规范。我们强烈建议你仔细阅读评测脚本与主提示词。你可以直接使用提供的提示词，或对其进行调整以更好地适配你自定义的模型！ ## 数据集结构本数据集包含以下两部分： 1. **图像文件**：文档页面的高分辨率PNG渲染图，存储于`images/`目录下。 2. **标注文件**：包含问题、答案以及图像路径引用的Parquet文件（`val.parquet`）。 ## 联系方式若你对DocVQA 2026数据集与竞赛框架有任何疑问、技术支持需求或咨询事宜，请联系：**docvqa@cvc.uab.cat** 若需参与竞赛、查看排行榜或提交结果，请使用**RRC平台**：https://rrc.cvc.uab.es/?ch=34

提供机构：

maas

创建时间：

2026-02-25

5,000+

优质数据集

54 个

任务类型

进入经典数据集