VideoVista: Benchmarking Diverse and Complex Video-Language Interaction for MLLMs
收藏IEEE2026-04-17 收录
下载链接:
https://ieee-dataport.org/documents/videovista-benchmarking-diverse-and-complex-video-language-interaction-mllms
下载链接
链接失效反馈官方服务:
资源简介:
We introduce VideoVista, a comprehensive benchmark designed for evaluating the diverse and complex video-language interactive capabilities of Video-LLMs. Meanwhile, we propose an automated data generation framework to streamline the development of advanced Video-LLMs and enhance the efficiency of human annotation within the community. Specifically, we propose a structured task taxonomy to guide the development of VideoVista: 1) To assess the comprehensive capabilities of models, we collect 2,619 videos spanning over 154 domains from diverse platforms, e.g., YouTube, Bilibili, Xiaohongshu, covering content such as Science and Technology, Sports, and Entertainment. 2) To evaluate model robustness across temporal scales, the dataset includes videos ranging from one minute to over two hours in duration, challenging models in both short- and long-term video processing. 3) We introduce 8 major task categories encompassing 48 subtask types, designed to probe a wide spectrum of abilities, including object-event-whole video content understanding and prediction, English-Chinese cultural contexts, spatial and temporal reasoning, streaming question answering, and others.
提供机构:
Yunxin Li



