MMMU_Pro
收藏魔搭社区2026-05-17 更新2025-03-01 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/MMMU_Pro
下载链接
链接失效反馈官方服务:
资源简介:
# MMMU-Pro (A More Robust Multi-discipline Multimodal Understanding Benchmark)
[**🌐 Homepage**](https://mmmu-benchmark.github.io/) | [**🏆 Leaderboard**](https://mmmu-benchmark.github.io/#leaderboard) | [**🤗 Dataset**](https://huggingface.co/datasets/MMMU/MMMU_Pro) | [**🤗 Paper**](https://huggingface.co/papers/2409.02813) | [**📖 arXiv**](https://arxiv.org/abs/2409.02813) | [**GitHub**](https://github.com/MMMU-Benchmark/MMMU)
## 🔔News
- **🛠️🛠️ [2025-03-08] Fixed mismatch between inner image labels and shuffled options in Vision and Standard (10 options) settings. (test_Chemistry_5,94,147,216,314,345,354,461,560,570; test_Materials_450; test_Pharmacy_198; validation_Chemistry_12,26,29; validation_Materials_10,28; validation_Psychology_1)**
- **🛠️[2024-11-10] Added options to the Vision subset.**
- **🛠️[2024-10-20] Uploaded Standard (4 options) cases.**
- **🔥[2024-09-05] Introducing [MMMU-Pro](https://arxiv.org/abs/2409.02813), a robust version of MMMU benchmark for multimodal AI evaluation! 🚀**
# Introduction
MMMU-Pro is an enhanced multimodal benchmark designed to rigorously assess the true understanding capabilities of advanced AI models across multiple modalities. It builds upon the original MMMU benchmark by introducing several key improvements that make it more challenging and realistic, ensuring that models are evaluated on their genuine ability to integrate and comprehend both visual and textual information.

## Key Features
- **Multimodal Understanding:** The dataset includes a diverse set of questions that require models to interpret and integrate both visual and textual information, reflecting real-world scenarios where users often interact with embedded content.
- **Increased Complexity:** MMMU-Pro introduces a vision-only input setting and increases the number of candidate options from 4 to 10, making it significantly harder for models to rely on guessing or exploiting shortcuts.
- **Real-World Simulation:** The vision-only questions are derived from screenshots or photos captured within a simulated display environment. These variations include different backgrounds, font styles, and sizes, closely mimicking real-world conditions where users might provide integrated visual-textual content.
# Dataset Details
The dataset is organized into two subsets:
- **Standard:** This subset increases the number of candidate answers to 10, making it more challenging for models to guess the correct answer.
- `id`: Unique identifier for each question.
- `question`: The textual question that needs to be answered.
- `options`: A list of 10 possible answers for the question.
- `explanation`: A detailed explanation of the correct answer, useful for understanding the reasoning behind it.
- `image_[num]`: Associated images relevant to the question, where `[num]` is a placeholder for image numbering (e.g., image_1, image_2).
- `image_type`: Describes the type of images included (e.g., chart, diagram, map).
- `answer`: The correct answer from the list of options.
- `topic_difficulty`: A measure of the difficulty of the topic.
- `subject`: The academic subject or field to which the question belongs.
- **Vision:** In this subset, questions are embedded within screenshots or photos, and models must integrate visual and textual information to answer correctly. No separate text is fed into the model.
- `id`: Unique identifier for each question.
- `image`: The image containing both the question and information needed to answer it.
- `answer`: The correct answer to the question.
- `subject`: The academic subject or field to which the question belongs.
## Usage
```
from datasets import load_dataset
mmmu_pro_vision = load_dataset("MMMU/MMMU_Pro", "vision")
mmmu_pro_standard_4 = load_dataset("MMMU/MMMU_Pro", "standard (4 options)")
mmmu_pro_standard_10 = load_dataset("MMMU/MMMU_Pro", "standard (10 options)")
```
# Methods
- **Filtering Questions:** Initially, questions answerable by text-only models were filtered out. Four strong open-source LLMs were tasked with answering the MMMU questions without images. Questions consistently answered correctly were excluded, resulting in a refined dataset.
- **Augmenting Candidate Options:** To reduce the reliance on option-based guessing, the number of candidate answers was increased from four to ten, making the task significantly more complex.
- **Enhancing Evaluation with Vision-Only Input Setting:** To further challenge models, a vision-only input setting was introduced. Questions are embedded in screenshots or photos, demanding integration of visual and textual information without separate text input.
# Overall Results
- **Comparison with MMMU:** The combined challenges of additional candidate options and vision-only input resulted in a substantial performance decrease from the original MMMU.
|Model |MMMU-Pro|MMMU (Val)|
|---------------------|--------|----------|
|GPT-4o (0513) |51.9 |69.1 |
|Claude 3.5 Sonnet |51.5 |68.3 |
|Gemini 1.5 Pro (0801)|46.9 |65.8 |
|Gemini 1.5 Pro (0523)|43.5 |62.2 |
|InternVL2-Llama3-76B |40.0 |58.3 |
|GPT-4o mini |37.6 |59.4 |
|InternVL2-40B |34.2 |55.2 |
|LLaVA-OneVision-72B |31.0 |56.8 |
|InternVL2-8B |29.0 |51.2 |
|MiniCPM-V 2.6 |27.2 |49.8 |
|VILA-1.5-40B |25.0 |51.9 |
|Llava-NEXT-72B |25.1 |49.9 |
|LLaVA-OneVision-7B |24.1 |48.8 |
|LLaVA-NeXT-34B |23.8 |48.1 |
|Idefics3-8B-Llama3 |22.9 |46.6 |
|Phi-3.5-Vision |19.7 |43.0 |
|LLaVA-NeXT-7B |17.0 |35.3 |
|LLaVA-NeXT-13B |17.2 |36.2 |
*Table 1: Overall results of different models on MMMU-Pro and MMMU (Val).*
## Disclaimers
The guidelines for the annotators emphasized strict compliance with copyright and licensing rules from the initial data source, specifically avoiding materials from websites that forbid copying and redistribution.
Should you encounter any data samples potentially breaching the copyright or licensing regulations of any site, we encourage you to [contact](#contact) us. Upon verification, such samples will be promptly removed.
## Contact
- Xiang Yue: xiangyue.work@gmail.com
# Citation
**BibTeX:**
```bibtex
@article{yue2024mmmu,
title={MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark},
author={Xiang Yue and Tianyu Zheng and Yuansheng Ni and Yubo Wang and Kai Zhang and Shengbang Tong and Yuxuan Sun and Botao Yu and Ge Zhang and Huan Sun and Yu Su and Wenhu Chen and Graham Neubig},
journal={arXiv preprint arXiv:2409.02813},
year={2024}
}
```
# MMMU-Pro(更鲁棒的多学科多模态理解基准)
[**🌐 主页**](https://mmmu-benchmark.github.io/) | [**🏆 排行榜**](https://mmmu-benchmark.github.io/#leaderboard) | [**🤗 数据集**](https://huggingface.co/datasets/MMMU/MMMU_Pro) | [**🤗 论文**](https://huggingface.co/papers/2409.02813) | [**📖 arXiv**](https://arxiv.org/abs/2409.02813) | [**GitHub**](https://github.com/MMMU-Benchmark/MMMU)
## 🔔新闻
- **🛠️🛠️ [2025-03-08] 修复了视觉与标准(10选项)设置中内部图像标签与打乱后选项不匹配的问题。(涉及样本:test_Chemistry_5、94、147、216、314、345、354、461、560、570;test_Materials_450;test_Pharmacy_198;validation_Chemistry_12、26、29;validation_Materials_10、28;validation_Psychology_1)**
- **🛠️[2024-11-10] 为视觉子集补充了选项。**
- **🛠️[2024-10-20] 上传了标准(4选项)样本集。**
- **🔥[2024-09-05] 推出用于多模态AI评估的MMMU基准增强版——[MMMU-Pro](https://arxiv.org/abs/2409.02813)!🚀**
## 引言
MMMU-Pro是一款增强型多模态基准测试集,旨在严格评估先进AI模型跨模态的真实理解能力。其基于原始MMMU基准开发,引入多项关键改进,大幅提升任务的挑战性与真实性,确保模型的评估聚焦于整合并理解视觉与文本信息的真实能力。

## 核心特性
- **多模态理解:** 本数据集涵盖多样化问题,要求模型解读并整合视觉与文本信息,贴合用户常与内嵌内容交互的真实应用场景。
- **复杂度提升:** MMMU-Pro新增纯视觉输入设置,并将候选答案数量从4个增至10个,大幅增加模型依靠猜测或利用捷径答题的难度。
- **真实场景模拟:** 纯视觉问题源自模拟显示环境中的截图或照片,包含不同背景、字体样式与尺寸,紧密模拟用户可能提供整合型视觉-文本内容的真实环境。
## 数据集详情
本数据集分为两个子集:
- **标准子集(Standard):** 该子集将候选答案数量提升至10个,增加模型通过猜测获取正确答案的难度。
- `id`:每个问题的唯一标识符。
- `question`:需要作答的文本问题。
- `options`:该问题的10个候选答案列表。
- `explanation`:正确答案的详细解释,有助于理解背后的推理逻辑。
- `image_[num]`:与问题相关的关联图像,`[num]`为图像编号占位符(例如image_1、image_2)。
- `image_type`:所包含图像的类型(例如图表、示意图、地图)。
- `answer`:候选答案列表中的正确答案。
- `topic_difficulty`:主题难度的量化指标。
- `subject`:问题所属的学术学科或领域。
- **视觉子集(Vision):** 在该子集中,问题内嵌于截图或照片中,模型必须整合视觉与文本信息才能正确作答,无单独文本输入模型。
- `id`:每个问题的唯一标识符。
- `image`:包含问题与作答所需全部信息的图像。
- `answer`:问题的正确答案。
- `subject`:问题所属的学术学科或领域。
## 使用方法
from datasets import load_dataset
mmmu_pro_vision = load_dataset("MMMU/MMMU_Pro", "vision")
mmmu_pro_standard_4 = load_dataset("MMMU/MMMU_Pro", "standard (4 options)")
mmmu_pro_standard_10 = load_dataset("MMMU/MMMU_Pro", "standard (10 options)")
## 方法学
- **问题筛选:** 首先过滤掉仅靠文本模型即可作答的问题。我们使用4个高性能开源大语言模型(Large Language Model, LLM)在无图像的情况下回答MMMU问题,剔除始终答对的问题,得到精炼后的数据集。
- **候选选项增强:** 为降低模型对选项式猜测的依赖,将候选答案数量从4个增至10个,大幅提升任务复杂度。
- **引入纯视觉输入设置以优化评估:** 为进一步挑战模型,我们新增纯视觉输入设置。问题内嵌于截图或照片中,要求模型在无单独文本输入的情况下整合视觉与文本信息。
## 整体性能结果
- **与MMMU基准对比:** 额外候选选项与纯视觉输入的联合挑战,导致模型在MMMU-Pro上的性能相较原始MMMU基准出现大幅下滑。
|模型 |MMMU-Pro|MMMU(验证集)|
|---------------------|--------|----------|
|GPT-4o (0513) |51.9 |69.1 |
|Claude 3.5 Sonnet |51.5 |68.3 |
|Gemini 1.5 Pro (0801)|46.9 |65.8 |
|Gemini 1.5 Pro (0523)|43.5 |62.2 |
|InternVL2-Llama3-76B |40.0 |58.3 |
|GPT-4o mini |37.6 |59.4 |
|InternVL2-40B |34.2 |55.2 |
|LLaVA-OneVision-72B |31.0 |56.8 |
|InternVL2-8B |29.0 |51.2 |
|MiniCPM-V 2.6 |27.2 |49.8 |
|VILA-1.5-40B |25.0 |51.9 |
|Llava-NEXT-72B |25.1 |49.9 |
|LLaVA-OneVision-7B |24.1 |48.8 |
|LLaVA-NeXT-34B |23.8 |48.1 |
|Idefics3-8B-Llama3 |22.9 |46.6 |
|Phi-3.5-Vision |19.7 |43.0 |
|LLaVA-NeXT-7B |17.0 |35.3 |
|LLaVA-NeXT-13B |17.2 |36.2 |
*表1:不同模型在MMMU-Pro与MMMU(验证集)上的整体性能结果。*
## 免责声明
标注人员的指南强调严格遵守原始数据源的版权与许可规则,特别避免使用来自禁止复制与分发的网站的素材。若您发现任何可能侵犯任何网站版权或许可规定的数据样本,欢迎[联系](#contact)我们,经核实后将立即移除该类样本。
## 联系方式
- Xiang Yue: xiangyue.work@gmail.com
# 引用
**BibTeX格式:**
bibtex
@article{yue2024mmmu,
title={MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark},
author={Xiang Yue and Tianyu Zheng and Yuansheng Ni and Yubo Wang and Kai Zhang and Shengbang Tong and Yuxuan Sun and Botao Yu and Ge Zhang and Huan Sun and Yu Su and Wenhu Chen and Graham Neubig},
journal={arXiv preprint arXiv:2409.02813},
year={2024}
}
提供机构:
maas
创建时间:
2025-02-23



