next-tat/MMDocBench

Name: next-tat/MMDocBench
Creator: next-tat
Published: 2024-10-30 03:38:21
License: 暂无描述

Hugging Face2024-10-30 更新2025-04-12 收录

下载链接：

https://hf-mirror.com/datasets/next-tat/MMDocBench

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-sa-4.0 task_categories: - question-answering - visual-question-answering - table-question-answering language: - en pretty_name: MMDocBench size_categories: - 1K<n<10K tags: - LVLMs - Document-Understanding --- # MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained Visual Document Understanding **MMDocBench** is an open-sourced benchmark with various OCR-free document understanding tasks for evaluating fine-grained visual perception and reasoning abilities. For more details, please refer to the project page: https://MMDocBench.github.io/.  ## Dataset Structure  MMDocBench consists of 15 main tasks and 48 sub-tasks, involving 2,400 document images, 4,338 QA pairs and 11,353 supporting regions (i.e., bounding boxes). The breakdown is described below: | Main Task | Sub Task | Document Image Type | # Images | # QA Pairs | # Regions | |:---:|:---:|:---:|:---:|:---:|:---:| | | | **Fine-Grained Visual Perception** | | | | | Text Recognition | TextOCR BookOCR | Scene-Text Images Book Covers | 100 100 | 100 100 | 100 438 | | Table Recognition | FinTabNet PubTables-1M | Financial Reports Scientific Papers | 100 100 | 100 100 | 1,864 3,520 | | Text Localization | Text2Bbox Bbox2Text | Industry Documents Industry Documents | 100 100 | 100 100 | 100 100 | | Table Cell Localization | FinTabNet PubTables-1M | Financial Reports Scientific Papers | 100 100 | 100 100 | 100 100 | | Key Information Extraction | SROIE WildReceipt CORD | Receipts Receipts Receipts | 100 100 100 | 303 512 372 | 303 512 372 | | Doc Forgery Detection | T-SROIE DocTamper | Receipts Cross-Domain Documents | 100 100 | 100 100 | 286 129 | | Document QA | DocVQA WTQ TAT-DQA | Industry Documents Wikipedia Tables Financial Reports | 100 100 100 | 262 351 214 | 262 351 214 | | Chart QA | ChartQA CharXiv | Cross-Domain Charts Scientific Charts | 100 100 | 104 149 | 104 149 | | Infographic QA | InfographicVQA | Infographics | 100 | 281 | 281 | | | | **Fine-Grained Visual Reasoning** | | | | | Arithmetic Reasoning | DUDE WTQ TAT-DQA CharXiv InfographicVQA | General Documents Wikipedia Tables Financial Table-Text Documents Scientific Charts Infographics | 13 54 98 23 34 | 15 55 217 23 53 | 34 159 453 67 90 | | Logical Reasoning | DUDE WTQ TAT-DQA CharXiv InfographicVQA | General Documents Wikipedia Tables Financial Table-Text Documents Scientific Charts Infographics | 10 11 1 7 2 | 11 11 1 7 2 | 20 41 2 12 3 | | Spatial Reasoning | DUDE WTQ CharXiv InfographicVQA | General Documents Wikipedia Tables Scientific Charts Infographics | 38 4 7 17 | 41 4 7 23 | 43 8 12 54 | | Comparison | DUDE WTQ TAT-DQA CharXiv InfographicVQA | General Documents Wikipedia Tables Financial Table-Text Documents Scientific Charts Infographics | 3 33 10 16 13 | 3 34 10 16 15 | 6 74 30 44 44 | | Sorting | DUDE WTQ TAT-DQA CharXiv InfographicVQA | General Documents Wikipedia Tables Financial Table-Text Documents Scientific Charts Infographics | 3 6 7 15 20 | 3 12 7 15 29 | 6 23 14 29 57 | | Counting | DUDE WTQ TAT-DQA CharXiv InfographicVQA | General Documents Wikipedia Tables Financial Table-Text Documents Scientific Charts Infographics | 51 15 14 38 44 | 55 15 14 40 52 | 244 76 26 149 248 | | | | | | | | ## Data Fields - **index:** The id of the data instance. - **image:** The image associated with the instance that is encoded in base64. - **raw_question:** The base question. - **question:** The base question embedded into instruction that specifies requirements such as formating and normalization. - **answer:** The ground-truth in json format that contains text and bounding box. - **task:** The main task by the data instance, which consists of tasks such as `Text Recognition`, `Text Localization` and `Document Question Answering`. - **sub_task:** The sub-task by the data instance, which normally refers to datasets. - **capability:** The top-level task category by the data instance, which is either `Visual Perception` or `Visual Reasoning`. ## How to use You can download the dataset to a local directory as follows: ```bash git clone https://huggingface.co/datasets/next-tat/MMDocBench/ ``` ## Citation ``` @misc{zhu2024mmdocbench title={MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained Visual Document Understanding}, author={Fengbin Zhu and Ziyang Liu and Xiang Yao Ng and Haohui Wu and Wenjie Wang and Fuli Feng and Chao Wang and Huanbo Luan and Tat Seng Chua}, year={2024}, eprint={2410.21311}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2410.21311}, } ``` ## Licence The benchmark is distributed under the [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/deed.en) license.

提供机构：

next-tat

5,000+

优质数据集

54 个

任务类型

进入经典数据集