DT-VQA
收藏arXiv2025-09-30 收录
下载链接:
https://github.com/Yuliang-Liu/MultimodalOCR
下载链接
链接失效反馈官方服务:
资源简介:
该数据集名为DT-VQA,包含了17万个问题-答案对,这些对子是从3万张图片中生成的,主要关注文档、表格和产品描述中密集文本的内容。该数据集旨在探索大型多模态模型(LMMs)在处理密集文本任务上的能力,并包含了多种图像风格,如结构化的表格和未结构化的场景图像。规模上,该数据集由3万张图片生成了17万个问题-答案对。其任务是针对密集文本图像进行视觉问题回答(Vqa)。
The dataset named DT-VQA contains 170,000 question-answer pairs generated from 30,000 images, with a primary focus on dense text content in documents, tables, and product descriptions. It is designed to explore the capabilities of large multimodal models (LMMs) in handling dense text-related tasks, and includes diverse image styles such as structured tables and unstructured scene images. The core task of this dataset is visual question answering (VQA) targeting dense text images.



