MMVU

Name: MMVU
Creator: maas
Published: 2026-05-22 02:15:17
License: 暂无描述

魔搭社区2026-05-22 更新2025-02-01 收录

下载链接：

https://modelscope.cn/datasets/AI-ModelScope/MMVU

下载链接

链接失效反馈

官方服务：

资源简介：

<h1 align="center"> MMVU: Measuring Expert-Level Multi-Discipline Video Understanding </h1> <p align="center"> <a href="https://mmvu-benchmark.github.io/">🌐 Homepage</a> • <a href="https://mmvu-benchmark.github.io/#leaderboard">🥇 Leaderboard</a> • <a href="https://huggingface.co/papers/2501.12380">📖 Paper</a> • <a href="https://huggingface.co/datasets/yale-nlp/MMVU">🤗 Data</a> </p> ## 📰 News - **2025-01-21**: We are excited to release the MMVU paper, dataset, and evaluation code! ## 👋 Overview ![Local Image](./assets/overview.png) ### Why MMVU Benchmark? Despite the rapid progress of foundation models in both text-based and image-based expert reasoning, there is a clear gap in evaluating these models’ capabilities in **specialized-domain video** understanding. Videos inherently capture **temporal dynamics**, **procedural knowledge**, and **complex interactions**—all of which are crucial for expert-level tasks across disciplines like healthcare, engineering, and scientific research. Unlike static images or text, specialized-domain videos often require integrating **domain-specific expertise** (e.g., understanding chemical reactions, medical procedures, or engineering workflows) alongside traditional **visual perception**. MMVU is designed to **bridge this gap** and offer a **multidisciplinary** perspective by providing: - **3,000 expert-annotated QA examples** spanning **1,529 specialized-domain videos** across **27 subjects** in **four key disciplines** (Science, Healthcare, Humanities & Social Sciences, and Engineering). - Ensures both **breadth of domain knowledge** and **depth of reasoning**, reflecting real-world complexities in specialized fields. - Each example comes with **expert-annotated reasoning rationales** and **relevant domain knowledge**, enabling researchers to assess not just **answer correctness** but also **reasoning quality**. ## 🚀 Quickstart ### 1. Setup Install the required packages and Setup up `.env` file ```bash pip install -r requirements.txt ``` **Dataset Example Feature**: ```bash { "id": // Unique ID for the question "video": // HF download link to the video "youtube_url": // original Youtube URL to the video "question_type": // "open-ended" or "multiple-choice" "metadata": { "subject": // Subject of the example "textbook": // From which textbook the example is curated from "rationale": // rationale for the answer (Coming Soon!) "knowledge": // List of wikipedia URLs for the domain knowledge (Coming Soon!) }, "question": // The question "choices": // choices for multiple-choice questions "answer": // answer to the question }, ``` ### 2. Response Generation As detailed in Appendix B.1, we evaluate models using three different types of model inference: API-based, vllm, and HuggingFace, depending on the specific model's availability. To generate responses for the MMVU validation set, run the following command: ```bash bash model_inference_scripts/run_api_models.sh # Run all API models bash model_inference_scripts/run_hf_models.sh # Run model inference using HuggingFace bash model_inference_scripts/run_vllm_image_models.sh # Run model that supports multi-image input using vllm bash model_inference_scripts/run_vllm_video_models.sh # Run model that supports video input using vllm ``` The generated responses will be saved in the `outputs/validation_{prompt}` directory. Where `{prompt}` is `cot` for CoT reasoning and `direct-output` for direct answering without intermediate reasoning steps. ### 3. Evaluation To evaluate the generated responses, run the following command: ```bash python acc_evaluation.py --output_dir <output_dir> ``` The evaluation results will be saved in the `outputs/evaluation_results/` directory. ## 📋 Results from Existing Models We release full results on the validation set (i.e., generated responses, accuracy measurement done by GPT-4o) for all models we tested in our [HuggingFace Repo (Coming Soon!)](https://huggingface.co/datasets/yale-nlp/MMVU_model_outputs). If you are interested in doing some fine-grained analysis on these results, feel free to use them! ## 🥇 Leaderboard Submission The MMVU test set remains hidden from the public to minimize data contamination and ensure an unbiased evaluation of model capabilities. We are developing an online submission system for the leaderboard. In the meantime, if you would like to evaluate your model or method on the MMVU test set before the submission system becomes available, please reach out to Yilun Zhao at yilun.zhao@yale.edu and share the codebase you used to generate results on the validation set. We will run your model on the test set and provide you with the evaluation results. You could then decide whether to update your results to the leaderboard. ## ✍️ Citation If you use our work and are inspired by our work, please consider cite us (available soon): ``` @misc{zhao2025mmvu, title={MMVU: Measuring Expert-Level Multi-Discipline Video Understanding}, author={Yilun Zhao and Lujing Xie and Haowei Zhang and Guo Gan and Yitao Long and Zhiyuan Hu and Tongyan Hu and Weiyuan Chen and Chuhan Li and Junyang Song and Zhijian Xu and Chengye Wang and Weifeng Pan and Ziyao Shangguan and Xiangru Tang and Zhenwen Liang and Yixin Liu and Chen Zhao and Arman Cohan}, year={2025}, eprint={2501.12380}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2501.12380}, } ```

<h1 align="center">MMVU：专家级多学科视频理解评测基准</h1> <p align="center"><a href="https://mmvu-benchmark.github.io/">🌐 主页</a> • <a href="https://mmvu-benchmark.github.io/#leaderboard">🥇 排行榜</a> • <a href="https://huggingface.co/papers/2501.12380">📖 论文</a> • <a href="https://huggingface.co/datasets/yale-nlp/MMVU">🤗 数据集</a></p> ## 📰 动态 - **2025-01-21**：我们正式发布MMVU的论文、数据集与评测代码！ ## 👋 概览 ![本地图片](./assets/overview.png) ### 为何选择MMVU评测基准？尽管基础模型在基于文本与基于图像的专家推理领域已取得飞速进展，但在评估这些模型的**专业领域视频**理解能力方面仍存在显著空白。视频天然蕴含**时序动态性**、**过程性知识**与**复杂交互关系**，这些要素对于医疗、工程、科研等多领域的专家级任务至关重要。与静态图像或文本不同，专业领域视频往往需要将**领域专属专业知识**（例如理解化学反应、医疗流程或工程工作流）与传统**视觉感知**能力相结合。 MMVU旨在**填补这一空白**，并通过以下内容提供**多学科**视角： - 覆盖四大核心学科（自然科学、医疗健康、人文与社会科学、工程学）下27个细分主题的1529条专业领域视频，配套3000条专家标注问答（QA）示例； - 兼顾领域知识的**广度**与推理能力的**深度**，真实反映专业领域的实际复杂性； - 每条示例均附带**专家标注的推理依据**与**相关领域知识**，使研究者不仅能够评估**答案正确性**，还能评测**推理质量**。 ## 🚀 快速上手 ### 1. 环境配置安装依赖包并配置`.env`文件 bash pip install -r requirements.txt **数据集示例结构**： bash { "id": // 问答唯一标识符 "video": // 视频的HuggingFace下载链接 "youtube_url": // 视频原始Youtube链接 "question_type": // 问答类型，可选“开放式”或“单项选择” "metadata": { "subject": // 示例所属主题 "textbook": // 示例来源教材 "rationale": // 答案推理依据（即将上线） "knowledge": // 领域知识相关维基百科链接列表（即将上线） }, "question": // 提问内容 "choices": // 单项选择题选项 "answer": // 问题答案 }, ### 2. 模型响应生成如附录B.1所述，我们根据模型的可用情况，采用三种不同的模型推理方式：基于API的推理、vllm推理与HuggingFace推理。若要为MMVU验证集生成响应，请运行以下命令： bash bash model_inference_scripts/run_api_models.sh # 运行所有API模型 bash model_inference_scripts/run_hf_models.sh # 使用HuggingFace运行模型推理 bash model_inference_scripts/run_vllm_image_models.sh # 使用vllm运行支持多图像输入的模型 bash model_inference_scripts/run_vllm_video_models.sh # 使用vllm运行支持视频输入的模型生成的响应将保存至`outputs/validation_{prompt}`目录中，其中`{prompt}`为`cot`时代表思维链（Chain of Thought, CoT）推理模式，为`direct-output`时代表无中间推理步骤的直接作答模式。 ### 3. 结果评测若要对生成的响应进行评测，请运行以下命令： bash python acc_evaluation.py --output_dir <output_dir> 评测结果将保存至`outputs/evaluation_results/`目录中。 ## 📋 现有模型评测结果我们已在[HuggingFace数据集仓库（即将上线）](https://huggingface.co/datasets/yale-nlp/MMVU_model_outputs)中发布所有测试模型在验证集上的完整评测结果（包括生成响应与由GPT-4o完成的准确率计算）。若您希望基于这些结果开展细粒度分析，可直接取用！ ## 🥇 排行榜提交 MMVU测试集暂未对公众开放，以尽可能减少数据污染，确保对模型能力的评测公平无偏。我们正在开发用于排行榜的在线提交系统。在此期间，若您希望在提交系统上线前在MMVU测试集上评测您的模型或方法，请联系Yilun Zhao（邮箱：yilun.zhao@yale.edu）并提供您用于在验证集上生成结果的代码库。我们将在测试集上运行您的模型并向您反馈评测结果，您可自行决定是否将结果更新至排行榜。 ## ✍️ 引用声明若您使用本工作或受本工作启发，请引用我们的成果（引用信息即将上线）： @misc{zhao2025mmvu, title={MMVU: Measuring Expert-Level Multi-Discipline Video Understanding}, author={Yilun Zhao and Lujing Xie and Haowei Zhang and Guo Gan and Yitao Long and Zhiyuan Hu and Tongyan Hu and Weiyuan Chen and Chuhan Li and Junyang Song and Zhijian Xu and Chengye Wang and Weifeng Pan and Ziyao Shangguan and Xiangru Tang and Zhenwen Liang and Yixin Liu and Chen Zhao and Arman Cohan}, year={2025}, eprint={2501.12380}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2501.12380}, }

提供机构：

maas

创建时间：

2025-01-23

5,000+

优质数据集

54 个

任务类型

进入经典数据集