five

MLVU

收藏
魔搭社区2026-05-16 更新2024-11-02 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/MLVU
下载链接
链接失效反馈
官方服务:
资源简介:
<h1 align="center">MLVU: Multi-task Long Video Understanding Benchmark</h1> <p align="center"> <a href="https://arxiv.org/abs/2406.04264"> <img alt="Build" src="http://img.shields.io/badge/cs.CV-arXiv%3A2406.04264-B31B1B.svg"> </a> <a href="https://huggingface.co/datasets/MLVU/MLVU_Test"> <img alt="Build" src="https://img.shields.io/badge/🤗 Dataset-MLVU Benchmark (Test)-yellow"> </a> <a href="https://github.com/JUNJIE99/MLVU"> <img alt="Build" src="https://img.shields.io/badge/Github-MLVU: Multi task Long Video Understanding Benchmark-blue"> </a> </p> This repo contains the annotation data and evaluation code for the paper "[MLVU: A Comprehensive Benchmark for Multi-Task Long Video Understanding](https://arxiv.org/abs/2406.04264)". ## 🔔 News: - 🆕 7/28/2024: The data for the **MLVU-Test set** has been released ([🤗 Link](https://huggingface.co/datasets/MLVU/MLVU_Test))! The test set includes 11 different tasks, featuring our newly added Sports Question Answering (SQA, single-detail LVU) and Tutorial Question Answering (TQA, multi-detail LVU). The MLVU-Test has **expanded the number of options in multiple-choice questions to six**. While the ground truth of the MLVU-Test will remain undisclosed, everyone will be able to evaluate online (the online website will be coming soon!). 🔥 - 🎉 7/28/2024: The **MLVU-Dev set** has now been integrated into [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval)! You can now conveniently evaluate the multiple-choice questions (MLVU<sub>M</sub>) of the [MLVU-Dev set](https://huggingface.co/datasets/MLVU/MVLU) with a single click using lmms-eval. Thanks to the lmms-eval team! 🔥 - :trophy: 7/25/2024: We have released the [MLVU-Test leaderboard](https://github.com/JUNJIE99/MLVU?tab=readme-ov-file#trophy-mlvu-test-leaderboard) and incorporated evaluation results for several recently launched models, such as [LongVA](https://github.com/EvolvingLMMs-Lab/LongVA), [VILA](https://github.com/NVlabs/VILA), [ShareGPT4-Video](https://github.com/ShareGPT4Omni/ShareGPT4Video), etc. 🔥 - 🏠 6/19/2024: For better maintenance and updates of MLVU, we have migrated MLVU to this new repository. We will continue to update and maintain MLVU here. If you have any questions, feel free to raise an issue. 🔥 - 🥳 6/7/2024: We have released the MLVU [Benchmark](https://huggingface.co/datasets/MLVU/MVLU) and [Paper](https://arxiv.org/abs/2406.04264)! 🔥 ## License Our dataset is under the CC-BY-NC-SA-4.0 license. ⚠️ If you need to access and use our dataset, you must understand and agree: **This dataset is for research purposes only and cannot be used for any commercial or other purposes. The user assumes all effects arising from any other use and dissemination.** We do not own the copyright of any raw video files. Currently, we provide video access to researchers under the condition of acknowledging the above license. For the video data used, we respect and acknowledge any copyrights of the video authors. Therefore, for the movies, TV series, documentaries, and cartoons used in the dataset, we have reduced the resolution, clipped the length, adjusted dimensions, etc. of the original videos to minimize the impact on the rights of the original works. If the original authors of the related works still believe that the videos should be removed, please contact mlvubenchmark@gmail.com or directly raise an issue. ## Introduction We introduce MLVU: the first comprehensive benchmark designed for evaluating Multimodal Large Language Models (MLLMs) in Long Video Understanding (LVU) tasks. MLVU is constructed from a wide variety of long videos, with lengths ranging from 3 minutes to 2 hours, and includes nine distinct evaluation tasks. These tasks challenge MLLMs to handle different types of tasks, leveraging both global and local information from videos. Our evaluation of 20 popular MLLMs, including GPT-4o, reveals significant challenges in LVU, with even the top-performing GPT-4o only achieving an average score of 64.6% in multi-choice tasks. In addition, our empirical results underscore the need for improvements in context length, image understanding, and strong LLM-backbones. We anticipate that MLVU will serve as a catalyst for the community to further advance MLLMs' capabilities in understanding long videos. ![Statistical overview of our LVBench dataset. **Left:** Distribution of Video Duration; **Middle** Distribution of Source Types for Long Videos; **Right:** Quantification of Each Task Type.](./figs/statistic.png) ## 🏆 Mini-Leaderboard | Model | Input | M-Avg | G-Avg | | --- | --- | --- | --- | | Full mark | - | 100 | 10 | | [GPT-4o](https://openai.com/index/hello-gpt-4o/) | 0.5&nbsp;fps | 64.6 | 5.80 | | [Video-CCAM](https://github.com/QQ-MM/Video-CCAM) | 96 frm | 60.2 | 4.11 | | [VILA-1.5](https://github.com/NVlabs/VILA) | 14 frm | 56.7 | 4.31 | | [LongVA](https://github.com/EvolvingLMMs-Lab/LongVA) | 256&nbsp;frm | 56.3 | 4.33 | | [InternVL-1.5](https://github.com/OpenGVLab/InternVL) | 16 frm | 50.4 | 4.02 | | [GPT-4 Turbo](https://openai.com/index/gpt-4v-system-card/) | 16 frm | 49.2 | 5.35 | | [VideoLLaMA2-Chat](https://github.com/DAMO-NLP-SG/VideoLLaMA2) | 16 frm | 48.5 | 3.99 | | [VideoChat2_HD](https://github.com/OpenGVLab/Ask-Anything/tree/main/video_chat2) | 16 frm | 47.9 | 3.99 | | [Video-LLaVA](https://github.com/PKU-YuanGroup/Video-LLaVA) | 8 frm | 47.3 | 3.84 | | [ShareGPT4Video](https://github.com/ShareGPT4Omni/ShareGPT4Video) | 16 frm | 46.4 | 3.77 | | [VideoChat2-Vicuna](https://github.com/OpenGVLab/Ask-Anything/tree/main/video_chat2) | 16 frm | 44.5 | 3.81 | | [MiniGPT4-Video](https://github.com/Vision-CAIR/MiniGPT4-video) | 90 frm | 44.5 | 3.36 | | [Qwen-VL-Max](https://github.com/QwenLM/Qwen) | 16 frm | 42.2 | 3.96 | | [VTimeLLM](https://github.com/huangb23/VTimeLLM) | 100 frm | 41.9 | 3.94 | | [LLaVA-1.6](https://github.com/haotian-liu/LLaVA) | 16 frm | 39.3 | 3.23 | | [Claude-3-Opus](https://claude.ai/login?returnTo=%2F%3F) | 16 frm | 36.5 | 3.39 | | [MA-LMM](https://github.com/boheumd/MA-LMM) | 1000 frm | 36.4 | 3.46 | | [Video-LLaMA-2](https://github.com/DAMO-NLP-SG/Video-LLaMA) | 16 frm | 35.5 | 3.78 | | [LLaMA-VID](https://github.com/dvlab-research/LLaMA-VID) | 1 fps | 33.2 | 4.22 | | [Video-ChatGPT](https://github.com/mbzuai-oryx/Video-ChatGPT) | 100 frm | 31.3 | 3.90 | | [TimeChat](https://github.com/RenShuhuai-Andy/TimeChat) | 96 frm | 30.9 | 3.42 | | [VideoChat](https://github.com/OpenGVLab/Ask-Anything/tree/main/video_chat) | 16 frm | 29.2 | 3.66 | | [Movie-LLM](https://github.com/Deaddawn/MovieLLM-code) | 1 fps | 26.1 | 3.94 | | [mPLUG-Owl-V](https://github.com/X-PLUG/mPLUG-Owl) | 16 frm | 25.9 | 3.84 | | [MovieChat](https://github.com/rese1f/MovieChat) | 2048&nbsp;frm | 25.8 | 2.78 | | [Otter-V](https://github.com/Luodian/Otter) | 16 frm | 24.4 | 3.31 | | [Otter-I](https://github.com/Luodian/Otter) | 16 frm | 23.3 | 3.15 | ## License Our dataset is under the CC-BY-NC-SA-4.0 license. ⚠️ If you need to access and use our dataset, you must understand and agree: **This dataset is for research purposes only and cannot be used for any commercial or other purposes. The user assumes all effects arising from any other use and dissemination.** We do not own the copyright of any raw video files. Currently, we provide video access to researchers under the condition of acknowledging the above license. For the video data used, we respect and acknowledge any copyrights of the video authors. Therefore, for the movies, TV series, documentaries, and cartoons used in the dataset, we have reduced the resolution, clipped the length, adjusted dimensions, etc. of the original videos to minimize the impact on the rights of the original works. If the original authors of the related works still believe that the videos should be removed, please contact mlvubenchmark@gmail.com or directly raise an issue. ## MLVU Benchmark > Before you access our dataset, we kindly ask you to thoroughly read and understand the license outlined above. If you cannot agree to these terms, we request that you refrain from downloading our video data. The annotation file is readily accessible [here](https://github.com/FlagOpen/FlagEmbedding/tree/master/MLVU/data). For the raw videos, you can access them via this [<u>🤗 HF Link</u>](https://huggingface.co/datasets/MLVU/MVLU). MLVU encompasses nine distinct tasks, which include multiple-choice tasks as well as free-form generation tasks. These tasks are specifically tailored for long-form video understanding, and are classified into three categories: holistic understanding, single detail understanding, and multi-detail understanding. Examples of the tasks are displayed below. ![Task Examples of our MLVU.](./figs/task_example.png) ## Evaluation Please refer to our [evaluation](https://github.com/JUNJIE99/MLVU/tree/main/evaluation) and [evaluation_test](https://github.com/JUNJIE99/MLVU/tree/main/evaluation_test) folder for more details. ## Hosting and Maintenance The annotation files will be permanently retained. If some videos are requested to be removed, we will replace them with a set of video frames sparsely sampled from the video and adjusted in resolution. Since **all the questions in MLVU are only related to visual content** and do not involve audio, this will not significantly affect the validity of MLVU (most existing MLLMs also understand videos by frame extraction). If even retaining the frame set is not allowed, we will still keep the relevant annotation files, and replace them with the meta-information of the video, or actively seek more reliable and risk-free video sources. ## Citation If you find this repository useful, please consider giving a star 🌟 and citation ``` @article{MLVU, title={MLVU: A Comprehensive Benchmark for Multi-Task Long Video Understanding}, author={Zhou, Junjie and Shu, Yan and Zhao, Bo and Wu, Boya and Xiao, Shitao and Yang, Xi and Xiong, Yongping and Zhang, Bo and Huang, Tiejun and Liu, Zheng}, journal={arXiv preprint arXiv:2406.04264}, year={2024} } ```

<h1 align="center">MLVU:多任务长视频理解基准数据集</h1> <p align="center"> <a href="https://arxiv.org/abs/2406.04264"> <img alt="构建状态" src="http://img.shields.io/badge/cs.CV-arXiv%3A2406.04264-B31B1B.svg"> </a> <a href="https://huggingface.co/datasets/MLVU/MLVU_Test"> <img alt="数据集" src="https://img.shields.io/badge/🤗 Dataset-MLVU Benchmark (Test)-yellow"> </a> <a href="https://github.com/JUNJIE99/MLVU"> <img alt="GitHub仓库" src="https://img.shields.io/badge/Github-MLVU: Multi task Long Video Understanding Benchmark-blue"> </a> </p> 本仓库包含论文《MLVU:面向多任务长视频理解的综合基准测试集》的标注数据与评估代码,论文链接:<a href="https://arxiv.org/abs/2406.04264">https://arxiv.org/abs/2406.04264</a>。 ## 🔔 最新动态 - 🆕 2024年7月28日:**MLVU测试集(MLVU-Test)** 数据已发布(🤗链接:<a href="https://huggingface.co/datasets/MLVU/MLVU_Test">https://huggingface.co/datasets/MLVU/MLVU_Test</a>)!该测试集包含11项不同任务,新增了体育问答(Sports Question Answering, SQA,单细节长视频理解)与教程问答(Tutorial Question Answering, TQA,多细节长视频理解)。MLVU测试集将选择题的选项数量扩充至6个。尽管MLVU测试集的标准答案暂不公开,但所有用户均可通过线上方式进行评估(线上评估平台即将上线!)🔥 - 🎉 2024年7月28日:**MLVU开发集(MLVU-Dev)** 已集成至 <a href="https://github.com/EvolvingLMMs-Lab/lmms-eval">lmms-eval</a>!现在你可以通过lmms-eval一键评估MLVU开发集的选择题任务(MLVU<sub>M</sub>),详细链接:<a href="https://huggingface.co/datasets/MLVU/MLVU">https://huggingface.co/datasets/MLVU/MLVU</a>(注:原文存在笔误,已将MVLU修正为MLVU),感谢lmms-eval团队的支持🔥 - 🏆 2024年7月25日:我们发布了<a href="https://github.com/JUNJIE99/MLVU?tab=readme-ov-file#trophy-mlvu-test-leaderboard">MLVU测试集排行榜</a>,并纳入了多款近期发布模型的评估结果,例如<a href="https://github.com/EvolvingLMMs-Lab/LongVA">LongVA</a>、<a href="https://github.com/NVlabs/VILA">VILA</a>、<a href="https://github.com/ShareGPT4Omni/ShareGPT4Video">ShareGPT4-Video</a>等🔥 - 🏠 2024年6月19日:为更好地维护与更新MLVU,我们将MLVU迁移至本新仓库。我们将在此持续更新并维护MLVU。如有任何疑问,欢迎提交Issue🔥 - 🥳 2024年6月7日:我们发布了MLVU<a href="https://huggingface.co/datasets/MLVU/MLVU">基准数据集</a>与<a href="https://arxiv.org/abs/2406.04264">论文</a>!🔥 ## 许可证 本数据集采用CC-BY-NC-SA-4.0许可证发布。 ⚠️ 若你需要访问并使用本数据集,你必须理解并同意以下条款:**本数据集仅可用于学术研究目的,不得用于任何商业或其他用途。用户需承担因任何其他使用与传播行为所引发的全部责任。** 我们不拥有任何原始视频文件的版权。目前,我们仅在遵守上述许可证条款的前提下,向研究人员提供视频访问权限。对于所使用的视频数据,我们尊重并承认视频原作者的所有版权。因此,对于数据集中使用的电影、电视剧、纪录片与动画,我们已对原始视频进行了分辨率压缩、片段裁剪、尺寸调整等处理,以尽可能降低对原作品版权的影响。 若相关作品的原作者认为相关视频应当被移除,请联系mlvubenchmark@gmail.com或直接提交Issue。 ## 简介 我们推出MLVU:首个专为评估多模态大语言模型(Multimodal Large Language Model, MLLM)的长视频理解(Long Video Understanding, LVU)任务而设计的综合基准测试集。MLVU的构建素材涵盖各类长视频,时长从3分钟至2小时不等,包含9项独立的评估任务。这些任务要求模型利用视频的全局与局部信息,应对多种类型的任务挑战。 我们对包括GPT-4o在内的20款主流MLLM进行了评估,结果显示长视频理解任务仍存在显著挑战:即使是性能最优的GPT-4o,在选择题任务中的平均得分也仅为64.6%。此外,我们的实证结果表明,当前MLLM在上下文长度、图像理解能力以及优秀大语言模型骨干网络方面仍有较大改进空间。我们期望MLVU能够推动社区进一步提升MLLM的长视频理解能力。 ![本LVBench数据集的统计概览。**左图**:视频时长分布;**中图**:长视频来源类型分布;**右图**:各任务类型的量化统计。](./figs/statistic.png) ## 🏆 迷你排行榜 | 模型 | 输入配置 | M-Avg | G-Avg | | --- | --- | --- | --- | | 满分基准 | - | 100 | 10 | | <a href="https://openai.com/index/hello-gpt-4o/">GPT-4o</a> | 0.5&nbsp;帧/秒 | 64.6 | 5.80 | | <a href="https://github.com/QQ-MM/Video-CCAM">Video-CCAM</a> | 96 帧 | 60.2 | 4.11 | | <a href="https://github.com/NVlabs/VILA">VILA-1.5</a> | 14 帧 | 56.7 | 4.31 | | <a href="https://github.com/EvolvingLMMs-Lab/LongVA">LongVA</a> | 256&nbsp;帧 | 56.3 | 4.33 | | <a href="https://github.com/OpenGVLab/InternVL">InternVL-1.5</a> | 16 帧 | 50.4 | 4.02 | | <a href="https://openai.com/index/gpt-4v-system-card/">GPT-4 Turbo</a> | 16 帧 | 49.2 | 5.35 | | <a href="https://github.com/DAMO-NLP-SG/VideoLLaMA2">VideoLLaMA2-Chat</a> | 16 帧 | 48.5 | 3.99 | | <a href="https://github.com/OpenGVLab/Ask-Anything/tree/main/video_chat2">VideoChat2_HD</a> | 16 帧 | 47.9 | 3.99 | | <a href="https://github.com/PKU-YuanGroup/Video-LLaVA">Video-LLaVA</a> | 8 帧 | 47.3 | 3.84 | | <a href="https://github.com/ShareGPT4Omni/ShareGPT4Video">ShareGPT4Video</a> | 16 帧 | 46.4 | 3.77 | | <a href="https://github.com/OpenGVLab/Ask-Anything/tree/main/video_chat2">VideoChat2-Vicuna</a> | 16 帧 | 44.5 | 3.81 | | <a href="https://github.com/Vision-CAIR/MiniGPT4-video">MiniGPT4-Video</a> | 90 帧 | 44.5 | 3.36 | | <a href="https://github.com/QwenLM/Qwen">Qwen-VL-Max</a> | 16 帧 | 42.2 | 3.96 | | <a href="https://github.com/huangb23/VTimeLLM">VTimeLLM</a> | 100 帧 | 41.9 | 3.94 | | <a href="https://github.com/haotian-liu/LLaVA">LLaVA-1.6</a> | 16 帧 | 39.3 | 3.23 | | <a href="https://claude.ai/login?returnTo=%2F%3F">Claude-3-Opus</a> | 16 帧 | 36.5 | 3.39 | | <a href="https://github.com/boheumd/MA-LMM">MA-LMM</a> | 1000 帧 | 36.4 | 3.46 | | <a href="https://github.com/DAMO-NLP-SG/VideoLLaMA">Video-LLaMA-2</a> | 16 帧 | 35.5 | 3.78 | | <a href="https://github.com/dvlab-research/LLaMA-VID">LLaMA-VID</a> | 1 帧/秒 | 33.2 | 4.22 | | <a href="https://github.com/mbzuai-oryx/Video-ChatGPT">Video-ChatGPT</a> | 100 帧 | 31.3 | 3.90 | | <a href="https://github.com/RenShuhuai-Andy/TimeChat">TimeChat</a> | 96 帧 | 30.9 | 3.42 | | <a href="https://github.com/OpenGVLab/Ask-Anything/tree/main/video_chat">VideoChat</a> | 16 帧 | 29.2 | 3.66 | | <a href="https://github.com/Deaddawn/MovieLLM-code">Movie-LLM</a> | 1 帧/秒 | 26.1 | 3.94 | | <a href="https://github.com/X-PLUG/mPLUG-Owl">mPLUG-Owl-V</a> | 16 帧 | 25.9 | 3.84 | | <a href="https://github.com/rese1f/MovieChat">MovieChat</a> | 2048&nbsp;帧 | 25.8 | 2.78 | | <a href="https://github.com/Luodian/Otter">Otter-V</a> | 16 帧 | 24.4 | 3.31 | | <a href="https://github.com/Luodian/Otter">Otter-I</a> | 16 帧 | 23.3 | 3.15 | ## MLVU基准数据集 > 在访问本数据集前,请您仔细阅读并理解上述许可证条款。若您无法同意这些条款,请不要下载我们的视频数据。 标注文件可在此处<a href="https://github.com/FlagOpen/FlagEmbedding/tree/master/MLVU/data">获取</a>。原始视频可通过此<a href="https://huggingface.co/datasets/MLVU/MLVU">🤗 HF链接</a>获取(注:原文存在笔误,已将MVLU修正为MLVU)。 MLVU包含9项独立任务,涵盖选择题任务与自由格式生成任务。这些任务专为长视频理解设计,可分为三大类别:全局理解、单细节理解与多细节理解。任务示例展示如下。 ![MLVU任务示例](./figs/task_example.png) ## 评估 详细信息请参阅本仓库的<a href="https://github.com/JUNJIE99/MLVU/tree/main/evaluation">evaluation</a>与<a href="https://github.com/JUNJIE99/MLVU/tree/main/evaluation_test">evaluation_test</a>文件夹。 ## 托管与维护 标注文件将永久保留。 若部分视频被要求移除,我们将用从视频中稀疏采样并调整分辨率的视频帧集替代它们。由于**MLVU中的所有问题仅与视觉内容相关,不涉及音频**,这不会显著影响MLVU的有效性(多数现有MLLM也通过提取视频帧来理解视频)。 若连保留帧集也不被允许,我们仍将保留相关标注文件,并替换为视频的元信息,或主动寻找更可靠且无版权风险的视频来源。 ## 引用 若您认为本仓库对您的研究有所帮助,请考虑点亮🌟并引用以下文献: @article{MLVU, title={MLVU: A Comprehensive Benchmark for Multi-Task Long Video Understanding}, author={Zhou, Junjie and Shu, Yan and Zhao, Bo and Wu, Boya and Xiao, Shitao and Yang, Xi and Xiong, Yongping and Zhang, Bo and Huang, Tiejun and Liu, Zheng}, journal={arXiv preprint arXiv:2406.04264}, year={2024} }
提供机构:
maas
创建时间:
2024-10-29
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
MLVU是一个用于评估多模态大语言模型在长视频理解任务上的综合基准数据集,包含9个不同任务,视频长度从3分钟到2小时,覆盖整体、单细节和多细节理解。该数据集规模为429.58GB,仅限研究使用,旨在推动模型在长视频理解能力上的进步,当前测试显示即使顶级模型如GPT-4o也面临显著挑战。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作