iVQA (Instructional Video Question Answering)

Name: iVQA (Instructional Video Question Answering)
Creator: OpenDataLab
Published: 2026-05-24 11:30:31
License: 暂无描述

OpenDataLab2026-05-24 更新2024-05-09 收录

下载链接：

https://opendatalab.org.cn/OpenDataLab/iVQA

下载链接

链接失效反馈

官方服务：

资源简介：

最近的视觉问答方法依赖于大规模的注释数据集。然而，手动注释视频的问题和答案是乏味的、昂贵的并且阻碍了可扩展性。在这项工作中，我们建议避免手动注释，并利用自动跨模态监督生成用于视频问答的大规模训练数据集。我们利用在文本数据上训练的问题生成转换器，并使用它从转录的视频旁白中生成问答对。给定旁白视频，然后我们自动生成包含 69M 视频-问答三元组的 HowToVQA69M 数据集。为了处理该数据集中不同答案的开放词汇，我们提出了一种基于视频问题多模态变换器和答案变换器之间的对比损失的训练过程。我们引入了零镜头 VideoQA 任务并展示了出色的结果，特别是对于罕见的答案。此外，我们展示了我们的方法在 MSRVTT-QA、MSVD-QA、ActivityNet-QA 和 How2QA 上显着优于现有技术。最后，为了进行详细评估，我们介绍了 iVQA，这是一个新的 VideoQA 数据集，具有减少的语言偏差和高质量的冗余手动注释。

Recent visual question answering (VQA) approaches rely on large-scale annotated datasets. However, manually annotating questions and answers for videos is tedious, costly, and impedes scalability. In this work, we propose to bypass manual annotation and leverage automatic cross-modal supervision to generate large-scale training datasets for video question answering (VideoQA). We utilize a question-generating Transformer trained on textual corpora to produce question-answer pairs from transcribed video narrations. Given narrated videos, we automatically construct the HowToVQA69M dataset, which contains 69M video-question-answer triples. To address the open-vocabulary diverse answers in this dataset, we propose a training procedure based on the contrastive loss between a video-question multimodal Transformer and an answer Transformer. We introduce the zero-shot VideoQA task and demonstrate excellent results, especially for rare answers. Furthermore, we show that our method significantly outperforms the state-of-the-art on MSRVTT-QA, MSVD-QA, ActivityNet-QA, and How2QA. Finally, for detailed evaluation, we introduce iVQA, a new VideoQA dataset with reduced language bias and high-quality redundant manual annotations.

提供机构：

OpenDataLab

创建时间：

2022-09-01

搜集汇总

数据集介绍