CiteVQA

Name: CiteVQA
Creator: maas
Published: 2026-05-16 15:29:39
License: 暂无描述

魔搭社区2026-05-16 更新2026-05-17 收录

下载链接：

https://modelscope.cn/datasets/OpenDataLab/CiteVQA

下载链接

链接失效反馈

官方服务：

资源简介：

# CiteVQA [English](./README.md) | [简体中文](./README_ZH.md) **CiteVQA** is a document visual question answering benchmark for **faithful evidence attribution**. Unlike conventional DocVQA datasets that only score the final answer, CiteVQA requires a model to answer a question with evidence grounded in the source document at the **element level**. The benchmark is designed to evaluate whether a system can not only answer correctly, but also cite the right supporting region in long, real-world PDFs. The dataset contains **1,897 questions** built from **711 PDFs** across **7 macro-domains** and **30 sub-domains**, with an average of **40.6 pages per document**. It covers both **English** and **Chinese** documents, and includes **single-document** as well as **multi-document** settings. <img src="./readme_img/citevqa_example.png" width="100%"> ## Highlights - **Joint answer-and-evidence evaluation**: CiteVQA is built for evaluating both answer correctness and citation faithfulness. - **Element-level evidence**: Ground-truth evidence is provided as structured elements with bounding boxes, page indices, and document indices. - **Long-document setting**: Documents are multi-page PDFs with realistic length and layout complexity. - **Cross-domain and bilingual**: The benchmark spans **7 domains**, **30 sub-domains**, and two languages (`en`, `zh`). - **Multi-document reasoning**: In addition to single-document QA, the dataset includes cross-document questions requiring evidence aggregation. ## Pipeline CiteVQA is built with an automated pipeline that links documents, extracts evidence packages, synthesizes question-answer pairs, and validates crucial evidence for attribution-aware evaluation. <img src="./readme_img/citevqa_pipeline.png" width="100%"> ## Dataset Summary - **Documents**: 711 - **Questions**: 1,897 - **Languages**: 938 English, 959 Chinese - **Average / median pages per document**: 40.6 / 30.0 - **Dataset settings**: - `Single-Doc`: 987 - `Multi (1-Gold)`: 487 - `Multi (N-Gold)`: 423 - **Question types**: - `Complex Synthesis`: 839 - `Factual Retrieval`: 499 - `Multimodal Parsing`: 352 - `Quantitative Reasoning`: 207 - **Average evidence elements per question**: 2.57 - **Maximum evidence elements per question**: 10 ## What Is in the Dataset The main annotation file is: - [CiteVQA.json](./data/validation/CiteVQA.json): the benchmark annotations Each sample contains: - `index`: unique sample id - `Question`: the user question - `Standard_Answer`: gold answer - `Question_Type`: one of `Complex Synthesis`, `Factual Retrieval`, `Multimodal Parsing`, or `Quantitative Reasoning` - `dataset_type`: one of `Single-Doc`, `Multi (1-Gold)`, or `Multi (N-Gold)` - `language`: `en` or `zh` - `description`: domain / sub-domain description - `PDF_Source`: list of source PDF paths - `Evidence`: list of evidence elements ### Evidence Format Each evidence element uses a unified structure: ```json { "type": "equation", "content": "\\[\n\\Phi (\\kappa) = \\frac {4 \\pi \\left\\langle \\delta n _ {\\mathrm {R I}} ^ {2} \\right\\rangle L _ {0} ^ {2} (\\zeta - 1)}{\\left(1 + \\kappa^ {2} L _ {0} ^ {2}\\right) ^ {\\zeta}}, \\tag {7}\n\\]", "bbox": [ 649, 390, 683, 912 ], "angle": 0, "necessity": "necessary", "source_pdf_name": "e5be571f178039fee84e79edbd3ca66c7789348e57b7efa87c03fa91901923f2.pdf", "source_page_id": 2, "source_doc_index": 1 } ``` Field meanings: - `type`: evidence type, such as `text`, `title`, `table`, `image`, or `equation` - `content`: textual/structural content of the evidence; for tables, this may be HTML-like serialized table content; for images, it can be `null` - `bbox`: bounding box in the source page - `angle`: rotation angle - `necessity`: whether the element is marked as `necessary` or `non_necessary` - `source_pdf_name`: source PDF filename - `source_page_id`: 0-based or dataset-defined page index in the source PDF - `source_doc_index`: index of the source document within `PDF_Source` <details> <summary>Observed evidence element types</summary> - `text` - `title` - `table` - `image` - `image_caption` - `table_caption` - `equation` - `header` - `footer` - `list` - `ref_text` - `page_footnote` - `table_footnote` - `image_footnote` - `code` - `page_number` - `aside_text` </details> ## Example Sample <details> <summary>Show example sample</summary> ```json { "index": "ffb14537-fb4c-5aa4-b363-d8191f9bd61a_0", "Question_Type": "Multimodal Parsing", "Standard_Answer": "below", "Question": "On page 39, is the sentence specifying that the deal stays in effect until March 31, 2019, positioned above or below the section header for Article 32?", "Evidence": [ { "type": "text", "content": "32.01 This agreement shall be binding and continue in force and effect until the 31st day of March, 2019. (Amended, 2010, 2013, 2016)", "bbox": [ 465, 135, 501, 881 ], "angle": 0, "necessity": "necessary", "source_pdf_name": "ffb14537-fb4c-5aa4-b363-d8191f9bd61a.pdf", "source_page_id": 39, "source_doc_index": 1 }, { "type": "title", "content": "ARTICLE 32 TERM OF AGREEMENT, NOTICE TO BARGAIN AND RETROACTIVITY", "bbox": [ 431, 135, 448, 831 ], "angle": 0, "necessity": "necessary", "source_pdf_name": "ffb14537-fb4c-5aa4-b363-d8191f9bd61a.pdf", "source_page_id": 39, "source_doc_index": 1 } ], "dataset_type": "Single-Doc", "description": "Laws & Regulations, Gov & Legal", "language": "en", "PDF_Source": [ "data/pdf/ffb14537-fb4c-5aa4-b363-d8191f9bd61a.pdf" ] } ``` </details> ## Download the PDFs The annotation file stores the referenced PDF paths, while the actual PDFs can be downloaded with the provided script and source table. Files: - [data/download/download_pdfs.py](./data/download/download_pdfs.py) - [data/download/pdf_source.csv](./data/download/pdf_source.csv) From the repository root, run: ```bash python data/download/download_pdfs.py --workers 16 --out data/pdf --csv data/download/pdf_source.csv ``` This will download the PDFs into `data/pdf/`, matching the paths used in `PDF_Source`. ## Usage Load the JSON file: ```python import json with open("./data/validation/CiteVQA.json", "r", encoding="utf-8") as f: data = json.load(f) print(len(data)) print(data[0].keys()) ``` Basic iteration: ```python sample = data[0] question = sample["Question"] answer = sample["Standard_Answer"] pdfs = sample["PDF_Source"] evidence = sample["Evidence"] ``` ## 🏆 Evaluation Result We evaluated 20 state-of-the-art MLLMs on CiteVQA using a unified prompt template. The results show that faithful evidence attribution remains substantially harder than answer-only scoring. - **Best overall SAA**: `Gemini-3.1-Pro-Preview` reaches **76.0** SAA with **86.1** answer score. - **Best answer accuracy**: `GPT-5.4` reaches **87.1** answer score, but its SAA drops to **59.0**. - **Best open-source model**: `Qwen3-VL-235B-A22B` reaches **22.5** SAA with **72.3** answer score. - **Key finding**: a large gap between `Ans.` and `SAA` appears across models, highlighting the benchmark's `Attribution Hallucination` challenge. Full overall results: | Model | Category | Rec. | Rel. | Ans. | SAA | | --- | --- | ---: | ---: | ---: | ---: | | Gemini-3.1-Pro-Preview | Closed-source MLLMs | 66.0 | 83.6 | 86.1 | 76.0 | | Gemini-3-Flash-Preview | Closed-source MLLMs | 45.4 | 75.7 | 84.5 | 65.4 | | GPT-5.4 | Closed-source MLLMs | 31.0 | 67.5 | 87.1 | 59.0 | | Gemini-2.5-Pro | Closed-source MLLMs | 27.4 | 59.8 | 82.2 | 47.0 | | Seed2.0-Pro | Closed-source MLLMs | 28.5 | 54.9 | 81.3 | 44.1 | | GPT-5.2 | Closed-source MLLMs | 18.2 | 56.6 | 71.5 | 33.7 | | Qwen3.6-Plus | Closed-source MLLMs | 7.7 | 25.0 | 85.9 | 17.5 | | GLM-5V-Turbo | Closed-source MLLMs | 14.9 | 29.2 | 49.6 | 12.8 | | Qwen3-VL-235B-A22B | Open-source Large MLLMs | 11.3 | 35.3 | 72.3 | 22.5 | | Gemma-4-31B | Open-source Large MLLMs | 11.6 | 35.0 | 69.8 | 20.2 | | Kimi-K2.5 | Open-source Large MLLMs | 6.2 | 26.8 | 74.3 | 19.1 | | Qwen3.5-397B-A17B | Open-source Large MLLMs | 5.4 | 24.6 | 76.5 | 18.3 | | Qwen3.5-27B | Open-source Large MLLMs | 5.3 | 25.3 | 75.6 | 17.3 | | Qwen3-VL-32B | Open-source Large MLLMs | 6.6 | 30.5 | 72.3 | 17.3 | | Qwen3.5-122B-A10B | Open-source Large MLLMs | 3.9 | 19.0 | 73.6 | 14.8 | | Qwen3.5-9B | Open-source Small MLLMs | 1.6 | 14.7 | 65.0 | 11.1 | | Qwen3.5-35B-A3B | Open-source Small MLLMs | 1.7 | 13.7 | 76.4 | 10.7 | | Qwen3-VL-30B-A3B | Open-source Small MLLMs | 3.5 | 14.6 | 62.2 | 8.2 | | Qwen3-VL-8B | Open-source Small MLLMs | 1.0 | 14.7 | 61.2 | 7.5 | | Gemma-4-26B-A4B | Open-source Small MLLMs | 3.0 | 17.9 | 48.4 | 6.2 | ## Evaluation Code Evaluation code and benchmark updates are available in the official repository: - [https://github.com/opendatalab/CiteVQA](https://github.com/opendatalab/CiteVQA) ## Copyright Notice The PDF sources in CiteVQA are collected from publicly accessible web resources, primarily via Common Crawl. To respect copyright and redistribution constraints, this project releases structured annotations, metadata, and public download links, rather than redistributing protected PDF contents directly. CiteVQA is provided for academic research and non-commercial use only. We fully respect the rights of original copyright holders. If any rights holder believes that the inclusion, indexing, or use of any relevant content in this benchmark is inappropriate, please contact `OpenDataLab@pjlab.org.cn`. We will verify the request and remove or update the relevant content when appropriate. ## Citation ```bibtex @article{ma2026citevqa, title={CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence}, author={Ma, Dongsheng and Li, Jiayu and Wang, Zhengren and Wang, Yijie and Kong, Jiahao and Zeng, Weijun and Xiao, Jutao and Yang, Jie and Zhang, Wentao and Wang, Bin and He, Conghui}, journal={arXiv preprint arXiv:2605.12882}, year={2026} } ```

# CiteVQA [English](./README.md) | [简体中文](./README_ZH.md) **CiteVQA** 是面向**可信证据归因**的文档视觉问答（Document Visual Question Answering, DocVQA）基准。与仅对最终答案进行评分的传统DocVQA数据集不同，CiteVQA要求模型以**元素级**锚定在源文档中的证据来回答问题。该基准旨在评估系统不仅能否正确作答，还能在真实长PDF文档中引用正确的支撑区域。该数据集包含源自711份PDF文档的**1897个问题**，覆盖**7个宏观领域**与**30个子领域**，每份文档平均页数为40.6页。数据集涵盖**英文**与**中文**文档，并包含**单文档**及**多文档**两种设置。 <img src="./readme_img/citevqa_example.png" width="100%"> ## 核心亮点 - **联合答案与证据评估**：CiteVQA专为评估答案正确性与归因可信度而设计。 - **元素级证据**：标注证据以结构化元素形式提供，包含边界框、页码及文档索引。 - **长文档场景**：文档为多页PDF，具备真实的长度与布局复杂度。 - **跨领域与多语言**：该基准覆盖7个领域、30个子领域，支持两种语言（`en`，即英文；`zh`，即中文）。 - **多文档推理**：除单文档问答外，数据集还包含需要聚合多份文档证据的跨文档问题。 ## 构建流程 CiteVQA通过自动化流程构建，涵盖文档关联、证据包提取、问答对合成以及面向归因感知评估的关键证据验证环节。 <img src="./readme_img/citevqa_pipeline.png" width="100%"> ## 数据集概览 - **文档总数**：711 - **问题总数**：1897 - **语言分布**：英文938条，中文959条 - **单文档平均/中位数页数**：40.6 / 30.0 - **数据集设置类型**： - `单文档（Single-Doc）`：987条 - `单黄金文档多文档（Multi (1-Gold)）`：487条 - `多黄金文档多文档（Multi (N-Gold)）`：423条 - **问题类型**： - `复杂综合推理（Complex Synthesis）`：839条 - `事实检索（Factual Retrieval）`：499条 - `多模态解析（Multimodal Parsing）`：352条 - `定量推理（Quantitative Reasoning）`：207条 - **单问题平均证据元素数**：2.57 - **单问题最大证据元素数**：10 ## 数据集内容说明主标注文件为： - [CiteVQA.json](./data/validation/CiteVQA.json)：基准数据集标注每个样本包含以下字段： - `index`：唯一样本ID - `Question`：用户提问 - `Standard_Answer`：标准答案 - `Question_Type`：问题类型，可选值为`Complex Synthesis`、`Factual Retrieval`、`Multimodal Parsing`或`Quantitative Reasoning` - `dataset_type`：数据集设置类型，可选值为`Single-Doc`、`Multi (1-Gold)`或`Multi (N-Gold)` - `language`：语言标识，`en`或`zh` - `description`：领域/子领域描述 - `PDF_Source`：源PDF路径列表 - `Evidence`：证据元素列表 ### 证据元素格式每个证据元素采用统一结构： json { "type": "equation", "content": "\[ \Phi (kappa) = frac {4 pi leftlangle delta n _ {mathrm {R I}} ^ {2} ight angle L _ {0} ^ {2} (zeta - 1)}{left(1 + kappa^ {2} L _ {0} ^ {2} ight) ^ {zeta}}, ag {7} \]", "bbox": [ 649, 390, 683, 912 ], "angle": 0, "necessity": "necessary", "source_pdf_name": "e5be571f178039fee84e79edbd3ca66c7789348e57b7efa87c03fa91901923f2.pdf", "source_page_id": 2, "source_doc_index": 1 } 字段含义： - `type`：证据类型，例如`text`（文本）、`title`（标题）、`table`（表格）、`image`（图像）或`equation`（公式） - `content`：证据的文本/结构化内容；对于表格，该字段可为类HTML序列化表格内容；对于图像，该字段可为`null` - `bbox`：源页面中的边界框坐标 - `angle`：旋转角度 - `necessity`：标记该元素是否为`necessary`（必需）或`non_necessary`（非必需） - `source_pdf_name`：源PDF文件名 - `source_page_id`：源PDF中的0索引或数据集定义的页码 - `source_doc_index`：`PDF_Source`列表中源文档的索引 <details> <summary>已观测到的证据元素类型</summary> - `text`（文本） - `title`（标题） - `table`（表格） - `image`（图像） - `image_caption`（图像说明文字） - `table_caption`（表格说明文字） - `equation`（公式） - `header`（页眉） - `footer`（页脚） - `list`（列表） - `ref_text`（引用文本） - `page_footnote`（页面脚注） - `table_footnote`（表格脚注） - `image_footnote`（图像脚注） - `code`（代码） - `page_number`（页码） - `aside_text`（旁注文本） </details> ## 样本示例 <details> <summary>展示样本示例</summary> json { "index": "ffb14537-fb4c-5aa4-b363-d8191f9bd61a_0", "Question_Type": "Multimodal Parsing", "Standard_Answer": "below", "Question": "On page 39, is the sentence specifying that the deal stays in effect until March 31, 2019, positioned above or below the section header for Article 32?", "Evidence": [ { "type": "text", "content": "32.01 This agreement shall be binding and continue in force and effect until the 31st day of March, 2019. (Amended, 2010, 2013, 2016)", "bbox": [ 465, 135, 501, 881 ], "angle": 0, "necessity": "necessary", "source_pdf_name": "ffb14537-fb4c-5aa4-b363-d8191f9bd61a.pdf", "source_page_id": 39, "source_doc_index": 1 }, { "type": "title", "content": "ARTICLE 32 TERM OF AGREEMENT, NOTICE TO BARGAIN AND RETROACTIVITY", "bbox": [ 431, 135, 448, 831 ], "angle": 0, "necessity": "necessary", "source_pdf_name": "ffb14537-fb4c-5aa4-b363-d8191f9bd61a.pdf", "source_page_id": 39, "source_doc_index": 1 } ], "dataset_type": "Single-Doc", "description": "Laws & Regulations, Gov & Legal", "language": "en", "PDF_Source": [ "data/pdf/ffb14537-fb4c-5aa4-b363-d8191f9bd61a.pdf" ] } </details> ## PDF下载标注文件中存储了引用的PDF路径，实际PDF可通过提供的脚本与源表格下载。所需文件： - [data/download/download_pdfs.py](./data/download/download_pdfs.py) - [data/download/pdf_source.csv](./data/download/pdf_source.csv) 在仓库根目录执行以下命令： bash python data/download/download_pdfs.py --workers 16 --out data/pdf --csv data/download/pdf_source.csv 该命令会将PDF下载至`data/pdf/`目录，与`PDF_Source`中使用的路径匹配。 ## 使用方法加载JSON标注文件： python import json with open("./data/validation/CiteVQA.json", "r", encoding="utf-8") as f: data = json.load(f) print(len(data)) print(data[0].keys()) 基础遍历示例： python sample = data[0] question = sample["Question"] answer = sample["Standard_Answer"] pdfs = sample["PDF_Source"] evidence = sample["Evidence"] ## 🏆 评估结果我们使用统一提示模板在CiteVQA上评估了20个当前领先的多模态大语言模型（Multimodal Large Language Model, MLLM）。结果显示，可信证据归因任务的难度远高于仅基于答案的评分任务。 - **整体最优支持归因准确率（Supporting Attribution Accuracy, SAA）**：`Gemini-3.1-Pro-Preview`以76.0的SAA得分，搭配86.1的答案得分，位居榜首。 - **最优答案准确率**：`GPT-5.4`的答案得分达87.1，但其SAA仅为59.0。 - **最优开源模型**：`Qwen3-VL-235B-A22B`以22.5的SAA得分，搭配72.3的答案得分，成为开源模型中的最优者。 - **核心发现**：各模型的`答案准确率（Ans.）`与`SAA`之间存在显著差距，凸显了该基准面临的**归因幻觉（Attribution Hallucination）**挑战。完整整体评估结果： | 模型名称 | 模型类别 | 召回率（Recall, Rec.） | 相关度（Relevance, Rel.） | 答案准确率（Answer Accuracy, Ans.） | SAA | | --- | --- | ---: | ---: | ---: | ---: | | Gemini-3.1-Pro-Preview | 闭源多模态大模型 | 66.0 | 83.6 | 86.1 | 76.0 | | Gemini-3-Flash-Preview | 闭源多模态大模型 | 45.4 | 75.7 | 84.5 | 65.4 | | GPT-5.4 | 闭源多模态大模型 | 31.0 | 67.5 | 87.1 | 59.0 | | Gemini-2.5-Pro | 闭源多模态大模型 | 27.4 | 59.8 | 82.2 | 47.0 | | Seed2.0-Pro | 闭源多模态大模型 | 28.5 | 54.9 | 81.3 | 44.1 | | GPT-5.2 | 闭源多模态大模型 | 18.2 | 56.6 | 71.5 | 33.7 | | Qwen3.6-Plus | 闭源多模态大模型 | 7.7 | 25.0 | 85.9 | 17.5 | | GLM-5V-Turbo | 闭源多模态大模型 | 14.9 | 29.2 | 49.6 | 12.8 | | Qwen3-VL-235B-A22B | 开源大尺寸多模态大模型 | 11.3 | 35.3 | 72.3 | 22.5 | | Gemma-4-31B | 开源大尺寸多模态大模型 | 11.6 | 35.0 | 69.8 | 20.2 | | Kimi-K2.5 | 开源大尺寸多模态大模型 | 6.2 | 26.8 | 74.3 | 19.1 | | Qwen3.5-397B-A17B | 开源大尺寸多模态大模型 | 5.4 | 24.6 | 76.5 | 18.3 | | Qwen3.5-27B | 开源大尺寸多模态大模型 | 5.3 | 25.3 | 75.6 | 17.3 | | Qwen3-VL-32B | 开源大尺寸多模态大模型 | 6.6 | 30.5 | 72.3 | 17.3 | | Qwen3.5-122B-A10B | 开源大尺寸多模态大模型 | 3.9 | 19.0 | 73.6 | 14.8 | | Qwen3.5-9B | 开源小尺寸多模态大模型 | 1.6 | 14.7 | 65.0 | 11.1 | | Qwen3.5-35B-A3B | 开源小尺寸多模态大模型 | 1.7 | 13.7 | 76.4 | 10.7 | | Qwen3-VL-30B-A3B | 开源小尺寸多模态大模型 | 3.5 | 14.6 | 62.2 | 8.2 | | Qwen3-VL-8B | 开源小尺寸多模态大模型 | 1.0 | 14.7 | 61.2 | 7.5 | | Gemma-4-26B-A4B | 开源小尺寸多模态大模型 | 3.0 | 17.9 | 48.4 | 6.2 | ## 评估代码评估代码与基准数据集更新可在官方仓库获取： - [https://github.com/opendatalab/CiteVQA](https://github.com/opendatalab/CiteVQA) ## 版权声明 CiteVQA中的PDF源文件均收集自公开可访问的网络资源，主要通过Common Crawl平台获取。为尊重版权与再分发限制，本项目仅发布结构化标注、元数据与公开下载链接，而非直接分发受版权保护的PDF内容。 CiteVQA仅用于学术研究与非商业用途。我们充分尊重原始版权持有者的权利。若任何版权持有者认为本基准中包含、索引或使用的相关内容存在不当之处，请联系`OpenDataLab@pjlab.org.cn`。我们将核实相关请求，并在必要时移除或更新对应内容。 ## 引用格式 bibtex @article{ma2026citevqa, title={CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence}, author={Ma, Dongsheng and Li, Jiayu and Wang, Zhengren and Wang, Yijie and Kong, Jiahao and Zeng, Weijun and Xiao, Jutao and Yang, Jie and Zhang, Wentao and Wang, Bin and He, Conghui}, journal={arXiv preprint arXiv:2605.12882}, year={2026} }

提供机构：

maas

创建时间：

2026-05-12

5,000+

优质数据集

54 个

任务类型

进入经典数据集