ShotQA

Name: ShotQA
Creator: maas
Published: 2025-12-05 11:57:35
License: 暂无描述

魔搭社区2025-12-05 更新2025-07-12 收录

下载链接：

https://modelscope.cn/datasets/Vchitect/ShotQA

下载链接

链接失效反馈

官方服务：

资源简介：

# ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models This is the official dataset of **ShotQA**, the first large-scale training dataset designed for comprehensive cinematography understanding. It contains approximately 70k QA pairs, each consisting of an image or video clip, a cinematography-related question, and four multiple-choice options with one correct answer. - **Paper:** [ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models](https://arxiv.org/abs/2506.21356) - **Project Page:** [https://vchitect.github.io/ShotBench-project/](https://vchitect.github.io/ShotBench-project/) - **Code:** [https://github.com/Vchitect/ShotBench](https://github.com/Vchitect/ShotBench) ## Overview We introduce **ShotBench**, a comprehensive benchmark for evaluating VLMs’ understanding of cinematic language. It comprises over 3.5 k expert-annotated QA pairs derived from images and video clips of over 200 critically acclaimed films (predominantly Oscar-nominated), covering eight distinct cinematography dimensions. This provides a rigorous new standard for assessing fine-grained visual comprehension in film. We conducted an extensive evaluation of 24 leading VLMs, including prominent open-source and proprietary models, on ShotBench. Our results reveal a critical performance gap: even the most capable model, GPT-4o, achieves less than 60 % average accuracy. This systematically quantifies the current limitations of VLMs in genuine cinematographic comprehension. To address the identified limitations and facilitate future research, we constructed **ShotQA**, the first large-scale multimodal dataset for cinematography understanding, containing approximately 70 k high-quality QA pairs. Leveraging ShotQA, we developed **ShotVL**, a novel VLM trained using Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO). ShotVL significantly surpasses all tested open-source and proprietary models, establishing a new **state-of-the-art** on ShotBench. ## Citation If you find ShotBench useful for your research, please cite the following paper: ```bibtex @misc{ liu2025shotbench, title={ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models}, author={Hongbo Liu and Jingwen He and Yi Jin and Dian Zheng and Yuhao Dong and Fan Zhang and Ziqi Huang and Yinan He and Yangguang Li and Weichao Chen and Yu Qiao and Wanli Ouyang and Shengjie Zhao and Ziwei Liu}, year={2025}, eprint={2506.21356}, achivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2506.21356}, }

# ShotBench：视觉语言模型的专家级电影摄影理解能力基准这是**ShotQA**的官方数据集——首个面向全面电影摄影理解任务的大规模训练数据集。该数据集包含约7万组问答对，每组数据由单张图像或一段视频片段、一则电影摄影相关问题，以及四个带有唯一正确答案的多项选择选项构成。 - **论文**：[ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models](https://arxiv.org/abs/2506.21356) - **项目主页**：[https://vchitect.github.io/ShotBench-project/](https://vchitect.github.io/ShotBench-project/) - **代码仓库**：[https://github.com/Vchitect/ShotBench](https://github.com/Vchitect/ShotBench) ## 数据集概览我们提出**ShotBench**，一款用于评估视觉语言模型（Vision-Language Model, VLM）电影语言理解能力的综合基准测试集。其包含超过3500组由专家标注的问答对，数据源自200余部广受好评的影视作品（以奥斯卡提名影片为主）的图像与视频片段，涵盖八大独立的电影摄影维度，为评估影视领域细粒度视觉理解能力提供了严苛的全新评测标准。我们针对24款主流视觉语言模型（包含知名开源与闭源模型）在ShotBench上开展了全面评测。实验结果揭示了显著的性能鸿沟：即便性能顶尖的GPT-4o，其平均准确率也不足60%。该结果系统性量化了当前视觉语言模型在真实电影摄影理解任务中的局限性。为解决上述已识别的性能瓶颈并推动相关研究进展，我们构建了**ShotQA**——首个面向电影摄影理解任务的大规模多模态问答数据集，包含约7万组高质量问答对。依托ShotQA，我们开发了**ShotVL**：一款基于监督微调（Supervised Fine-Tuning, SFT）与群体相对策略优化（Group Relative Policy Optimization, GRPO）训练的新型视觉语言模型。ShotVL的性能显著超越所有参与测试的开源与闭源模型，在ShotBench上树立了全新的**当前最优（state-of-the-art）**性能标杆。 ## 引用方式若您的研究中使用了ShotBench，请引用以下论文： bibtex @misc{liu2025shotbench, title={ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models}, author={Hongbo Liu and Jingwen He and Yi Jin and Dian Zheng and Yuhao Dong and Fan Zhang and Ziqi Huang and Yinan He and Yangguang Li and Weichao Chen and Yu Qiao and Wanli Ouyang and Shengjie Zhao and Ziwei Liu}, year={2025}, eprint={2506.21356}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2506.21356}, }

提供机构：

maas

创建时间：

2025-07-07

5,000+

优质数据集

54 个

任务类型

进入经典数据集