five

next-tat/MMDocBench

收藏
Hugging Face2024-10-30 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/next-tat/MMDocBench
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-sa-4.0 task_categories: - question-answering - visual-question-answering - table-question-answering language: - en pretty_name: MMDocBench size_categories: - 1K<n<10K tags: - LVLMs - Document-Understanding --- # MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained Visual Document Understanding **MMDocBench** is an open-sourced benchmark with various OCR-free document understanding tasks for evaluating fine-grained visual perception and reasoning abilities. For more details, please refer to the project page: https://MMDocBench.github.io/. <!-- summary, dataset structure, data fields, how to download, citation, licence --> ## Dataset Structure <!-- ### Dataset Description --> MMDocBench consists of 15 main tasks and 48 sub-tasks, involving 2,400 document images, 4,338 QA pairs and 11,353 supporting regions (i.e., bounding boxes). The breakdown is described below: | Main Task | Sub Task | Document Image Type | # Images | # QA Pairs | # Regions | |:---:|:---:|:---:|:---:|:---:|:---:| | | | **Fine-Grained Visual Perception** | | | | | Text<br />Recognition | TextOCR<br />BookOCR | Scene-Text Images<br />Book Covers | 100<br />100 | 100<br />100 | 100<br />438 | | Table<br />Recognition | FinTabNet<br />PubTables-1M | Financial Reports<br />Scientific Papers | 100<br />100 | 100<br />100 | 1,864<br />3,520 | | Text<br />Localization | Text2Bbox<br />Bbox2Text | Industry Documents<br />Industry Documents | 100<br />100 | 100<br />100 | 100<br />100 | | Table Cell<br />Localization | FinTabNet<br />PubTables-1M | Financial Reports<br />Scientific Papers | 100<br />100 | 100<br />100 | 100<br />100 | | Key<br />Information<br />Extraction | SROIE<br />WildReceipt<br />CORD | Receipts<br />Receipts<br />Receipts | 100<br />100<br />100 | 303<br />512<br />372 | 303<br />512<br />372 | | Doc Forgery<br />Detection | T-SROIE<br />DocTamper | Receipts<br />Cross-Domain Documents | 100<br />100 | 100<br />100 | 286<br />129 | | Document<br />QA | DocVQA<br />WTQ<br />TAT-DQA<br /> | Industry Documents<br />Wikipedia Tables<br />Financial Reports | 100<br />100<br />100 | 262<br />351<br />214 | 262<br />351<br />214 | | Chart<br />QA | ChartQA<br />CharXiv | Cross-Domain Charts<br />Scientific Charts | 100<br />100 | 104<br />149 | 104<br />149 | | Infographic<br />QA | InfographicVQA | Infographics | 100 | 281 | 281 | | | | **Fine-Grained Visual Reasoning** | | | | | <br />Arithmetic<br />Reasoning | DUDE<br />WTQ<br />TAT-DQA<br />CharXiv<br />InfographicVQA | General Documents<br />Wikipedia Tables<br />Financial Table-Text Documents<br />Scientific Charts<br />Infographics | 13<br />54<br />98<br />23<br />34 | 15<br />55<br />217<br />23<br />53 | 34<br />159<br />453<br />67<br />90 | | <br />Logical<br />Reasoning | DUDE<br />WTQ<br />TAT-DQA<br />CharXiv<br />InfographicVQA | General Documents<br />Wikipedia Tables<br />Financial Table-Text Documents<br />Scientific Charts<br />Infographics | 10<br />11<br />1<br />7<br />2 | 11<br />11<br />1<br />7<br />2 | 20<br />41<br />2<br />12<br />3 | | <br />Spatial<br />Reasoning | DUDE<br />WTQ<br />CharXiv<br />InfographicVQA | General Documents<br />Wikipedia Tables<br />Scientific Charts<br />Infographics | 38<br />4<br />7<br />17 | 41<br />4<br />7<br />23 | 43<br />8<br />12<br />54 | | <br />Comparison | DUDE<br />WTQ<br />TAT-DQA<br />CharXiv<br />InfographicVQA | General Documents<br />Wikipedia Tables<br />Financial Table-Text Documents<br />Scientific Charts<br />Infographics | 3<br />33<br />10<br />16<br />13 | 3<br />34<br />10<br />16<br />15 | 6<br />74<br />30<br />44<br />44 | | <br />Sorting | DUDE<br />WTQ<br />TAT-DQA<br />CharXiv<br />InfographicVQA | General Documents<br />Wikipedia Tables<br />Financial Table-Text Documents<br />Scientific Charts<br />Infographics | 3<br />6<br />7<br />15<br />20 | 3<br />12<br />7<br />15<br />29 | 6<br />23<br />14<br />29<br />57 | | <br />Counting | DUDE<br />WTQ<br />TAT-DQA<br />CharXiv<br />InfographicVQA | General Documents<br />Wikipedia Tables<br />Financial Table-Text Documents<br />Scientific Charts<br />Infographics | 51<br />15<br />14<br />38<br />44 | 55<br />15<br />14<br />40<br />52 | 244<br />76<br />26<br />149<br />248 | | | | | | | | ## Data Fields - **index:** The id of the data instance. - **image:** The image associated with the instance that is encoded in base64. - **raw_question:** The base question. - **question:** The base question embedded into instruction that specifies requirements such as formating and normalization. - **answer:** The ground-truth in json format that contains text and bounding box. - **task:** The main task by the data instance, which consists of tasks such as `Text Recognition`, `Text Localization` and `Document Question Answering`. - **sub_task:** The sub-task by the data instance, which normally refers to datasets. - **capability:** The top-level task category by the data instance, which is either `Visual Perception` or `Visual Reasoning`. ## How to use You can download the dataset to a local directory as follows: ```bash git clone https://huggingface.co/datasets/next-tat/MMDocBench/ ``` ## Citation ``` @misc{zhu2024mmdocbench title={MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained Visual Document Understanding}, author={Fengbin Zhu and Ziyang Liu and Xiang Yao Ng and Haohui Wu and Wenjie Wang and Fuli Feng and Chao Wang and Huanbo Luan and Tat Seng Chua}, year={2024}, eprint={2410.21311}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2410.21311}, } ``` ## Licence The benchmark is distributed under the [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/deed.en) license.
提供机构:
next-tat
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作