HRVideoBench

Name: HRVideoBench
Creator: maas
Published: 2025-11-12 16:22:02
License: 暂无描述

魔搭社区2025-11-12 更新2025-02-08 收录

下载链接：

https://modelscope.cn/datasets/TIGER-Lab/HRVideoBench

下载链接

链接失效反馈

官方服务：

资源简介：

# HRVideoBench This repo contains the test data for **HRVideoBench**, which is released under the paper "VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation". [VISTA](https://huggingface.co/papers/2412.00927) is a video spatiotemporal augmentation method that generates long-duration and high-resolution video instruction-following data to enhance the video understanding capabilities of video LMMs. [**🌐 Homepage**](https://tiger-ai-lab.github.io/VISTA/) | [**📖 arXiv**](https://arxiv.org/abs/2412.00927) | [**💻 GitHub**](https://github.com/TIGER-AI-Lab/VISTA) | [**🤗 VISTA-400K**](https://huggingface.co/datasets/TIGER-Lab/VISTA-400K) | [**🤗 Models**](https://huggingface.co/collections/TIGER-Lab/vista-674a2f0fab81be728a673193) | [**🤗 HRVideoBench**](https://huggingface.co/datasets/TIGER-Lab/HRVideoBench) ## HRVideoBench Overview We observe that existing video understanding benchmarks are inadequate for accurately assessing the ability of video LMMs to understand high-resolution videos, especially the details inside the videos. Prior benchmarks mainly consist of low-resolution videos. More recent benchmarks focus on evaluating the long video understanding capability of video LMMs, which contain questions that typically pertain to a short segment in the long video. As a result, a model's high-resolution video understanding performance can be undermined if it struggles to sample or retrieve the relevant frames from a lengthy video sequence. To address this gap, we introduce HRVideoBench, a comprehensive benchmark with 200 multiple-choice questions designed to assess video LMMs for high-resolution video understanding. HRVideoBench focuses on the perception and understanding of small regions and subtle actions in the video. Our test videos are at least 1080p and contain 10 different video types collected with real-world applications in mind. For example, key applications of high-resolution video understanding include autonomous driving and video surveillance. We correspondingly collect POV driving videos and CCTV footage for the benchmark. Our benchmark consists of 10 types of questions, all of which are manually annotated and can be broadly categorized into object and action-related tasks. Examples of HRVideoBench questions are shown in the figure below. <p align="center"> <img src="https://tiger-ai-lab.github.io/VISTA/static/images/hrvideobench_examples.png" width="900"> </p> ## Usage We release the original video (under the folder `videos`) and the extracted JPEG video frames (`frames.zip`) in this repo. To access the 200 test questions, please refer to `hrvideobench.jsonl`. ## Citation If you find our paper useful, please cite us with ``` @misc{ren2024vistaenhancinglongdurationhighresolution, title={VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation}, author={Weiming Ren and Huan Yang and Jie Min and Cong Wei and Wenhu Chen}, year={2024}, eprint={2412.00927}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2412.00927}, } ```

# HRVideoBench 本仓库收录了**HRVideoBench**的测试数据，该数据集随论文《VISTA：通过视频时空增强提升长时高分辨率视频理解能力》一同发布。[VISTA](https://huggingface.co/papers/2412.00927)是一种视频时空增强方法，可生成长时高分辨率的视频指令跟随数据，以增强视频多模态大模型（video LMMs）的视频理解能力。 [**🌐 主页**](https://tiger-ai-lab.github.io/VISTA/) | [**📖 arXiv论文**](https://arxiv.org/abs/2412.00927) | [**💻 GitHub仓库**](https://github.com/TIGER-AI-Lab/VISTA) | [**🤗 VISTA-400K数据集**](https://huggingface.co/datasets/TIGER-Lab/VISTA-400K) | [**🤗 模型集合**](https://huggingface.co/collections/TIGER-Lab/vista-674a2f0fab81be728a673193) | [**🤗 HRVideoBench数据集**](https://huggingface.co/datasets/TIGER-Lab/HRVideoBench) ## HRVideoBench 概览我们发现，现有视频理解基准难以精准评估视频多模态大模型理解高分辨率视频、尤其是视频内部细节的能力。此前的基准大多仅包含低分辨率视频；而近期的部分基准虽聚焦于评估视频多模态大模型的长视频理解能力，但这类基准的问题通常仅涉及长视频中的单个短片段。这意味着若模型难以从冗长的视频序列中采样或检索到相关帧，其高分辨率视频理解性能便会受到负面影响。为填补这一研究空白，我们推出HRVideoBench——一个包含200道选择题的综合性基准测试集，用于评估视频多模态大模型的高分辨率视频理解能力。HRVideoBench聚焦于对视频中微小区域与细微动作的感知与理解。我们的测试视频分辨率至少为1080p，涵盖10种基于实际应用场景采集的视频类型。高分辨率视频理解的典型应用场景包括自动驾驶与视频监控，为此我们针对性采集了第一人称视角驾驶视频与闭路电视（CCTV）监控录像作为基准数据。本基准包含10类问题，所有问题均经人工标注，大致可分为与物体及动作相关的任务。HRVideoBench的问题示例如下图所示。 <p align="center"> <img src="https://tiger-ai-lab.github.io/VISTA/static/images/hrvideobench_examples.png" width="900"> </p> ## 使用方法本仓库中提供了原始视频（存放于`videos`文件夹）与提取出的JPEG格式视频帧（`frames.zip`）。如需获取200道测试题，请参阅`hrvideobench.jsonl`文件。 ## 引用若您认为我们的论文有参考价值，请通过以下BibTeX条目引用： @misc{ren2024vistaenhancinglongdurationhighresolution, title={VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation}, author={Weiming Ren and Huan Yang and Jie Min and Cong Wei and Wenhu Chen}, year={2024}, eprint={2412.00927}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2412.00927}, }

提供机构：

maas

创建时间：

2025-02-03

5,000+

优质数据集

54 个

任务类型

进入经典数据集