DT-VQA

arXiv2025-09-30 收录

下载链接：

https://github.com/Yuliang-Liu/MultimodalOCR

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集名为DT-VQA，包含了17万个问题-答案对，这些对子是从3万张图片中生成的，主要关注文档、表格和产品描述中密集文本的内容。该数据集旨在探索大型多模态模型（LMMs）在处理密集文本任务上的能力，并包含了多种图像风格，如结构化的表格和未结构化的场景图像。规模上，该数据集由3万张图片生成了17万个问题-答案对。其任务是针对密集文本图像进行视觉问题回答（Vqa）。

The dataset named DT-VQA contains 170,000 question-answer pairs generated from 30,000 images, with a primary focus on dense text content in documents, tables, and product descriptions. It is designed to explore the capabilities of large multimodal models (LMMs) in handling dense text-related tasks, and includes diverse image styles such as structured tables and unstructured scene images. The core task of this dataset is visual question answering (VQA) targeting dense text images.

5,000+

优质数据集

54 个

任务类型

进入经典数据集