NExT-GQA

arXiv2025-09-30 收录

下载链接：

https://github.com/doc-doc/next-gqa

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集名为NExT-GQA，是NExT-QA数据集的扩展版本，包含了10,500个与原始问答对相关联的时间定位标签。其目的是通过提供视觉证据来支持答案，从而提高视频问答系统的可靠性。该数据集特别为弱监督设置包含了时间标签，并由30名本科生于标注团队完成。它强调寻找视觉证据以支持答案，覆盖了8,911个问答对和1,557个视频，以推动带有视觉定位的视频问答任务的发展。

The dataset, named NExT-GQA, is an extended version of the NExT-QA dataset. It contains 10,500 temporal localization tags associated with the original question-answer pairs. Its core objective is to enhance the reliability of video question answering (VideoQA) systems by providing visual evidence to substantiate the corresponding answers. It specifically includes temporal localization tags for the weakly-supervised setting, and was annotated by a team of 30 undergraduate students. Covering 8,911 question-answer pairs and 1,557 videos, this dataset emphasizes the pursuit of visual evidence to support answers, aiming to advance the development of video question answering tasks with visual localization.

搜集汇总

数据集介绍

背景与挑战

背景概述

NExT-GQA是一个用于视觉基础视频问答（VideoQA）的数据集，旨在强制视觉语言模型在回答问题的同时，将相关视频时刻作为视觉证据进行定位，以揭示模型可能依赖捷径学习而非忠实多模态推理的问题。该数据集设计用于促进更可解释和可信赖的多模态技术研究，强调对视觉证据的严格要求，以评估模型的真实理解能力。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集