下载链接：

https://modelscope.cn/datasets/AI-ModelScope/MMMU

下载链接

链接失效反馈

官方服务：

资源简介：

# MMMU (A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI) [**🌐 Homepage**](https://mmmu-benchmark.github.io/) | [**🏆 Leaderboard**](https://mmmu-benchmark.github.io/#leaderboard) | [**🤗 Dataset**](https://huggingface.co/datasets/MMMU/MMMU/) | [**🤗 Paper**](https://huggingface.co/papers/2311.16502) | [**📖 arXiv**](https://arxiv.org/abs/2311.16502) | [**GitHub**](https://github.com/MMMU-Benchmark/MMMU) ## 🔔News - **‼️[2026-02-12] We have released the answers for the test set! You can now evaluate your models on the test set locally! 🎉** - **🛠️[2024-05-30]: Fixed duplicate option issues in Materials dataset items (validation_Materials_25; test_Materials_17, 242) and content error in validation_Materials_25.** - **🛠️[2024-04-30]: Fixed missing "-" or "^" signs in Math dataset items (dev_Math_2, validation_Math_11, 12, 16; test_Math_8, 23, 43, 113, 164, 223, 236, 287, 329, 402, 498) and corrected option errors in validation_Math_2. If you encounter any issues with the dataset, please contact us promptly!** - **🚀[2024-01-31]: We added Human Expert performance on the [Leaderboard](https://mmmu-benchmark.github.io/#leaderboard)!🌟** - **🔥[2023-12-04]: ~~Our evaluation server for test set is now availble on [EvalAI](https://eval.ai/web/challenges/challenge-page/2179/overview).~~ We welcome all submissions and look forward to your participation! 😆** ## Dataset Details ### Dataset Description We introduce MMMU: a new benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning. MMMU includes **11.5K meticulously collected multimodal questions** from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. These questions span **30 subjects** and **183 subfields**, comprising **30 highly heterogeneous image types**, such as charts, diagrams, maps, tables, music sheets, and chemical structures. We believe MMMU will stimulate the community to build next-generation multimodal foundation models towards expert artificial general intelligence (AGI). 🎯 **We have released a full set comprising 150 development samples, 900 validation samples and 10,500 test samples.** The development set is used for few-shot/in-context learning, and the validation set is used for debugging models, selecting hyperparameters, or quick evaluations. ~~The answers and explanations for the test set questions are withheld. You can submit your model's predictions for the **test set** on **[EvalAI](https://eval.ai/web/challenges/challenge-page/2179/overview)**.~~ The answers and explanations for the test set samples are now released. You can evaluate your models locally! ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6230d750d93e84e233882dbc/2Ulh9yznm1dvISV4xJ_Ok.png) ### Dataset Creation MMMU was created to challenge multimodal models with tasks that demand college-level subject knowledge and deliberate reasoning, pushing the boundaries of what these models can achieve in terms of expert-level perception and reasoning. The data for the MMMU dataset was manually collected by a team of college students from various disciplines, using online sources, textbooks, and lecture materials. - **Content:** The dataset contains 11.5K college-level problems across six broad disciplines (Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, Tech & Engineering) and 30 college subjects. - **Image Types:** The dataset includes 30 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures, interleaved with text. ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6230d750d93e84e233882dbc/Mbf8O5lEH8I8czprch0AG.png) ## 🏆 Mini-Leaderboard We show a mini-leaderboard here and please find more information in our paper or [**homepage**](https://mmmu-benchmark.github.io/). | Model | Val (900) | Test (10.5K) | |--------------------------------|:---------:|:------------:| | Expert (Best) | 88.6 | - | | Expert (Medium) | 82.6 | - | | Expert (Worst) | 76.2 | - | | GPT-4o* | **69.1** | - | | Gemini 1.5 Pro* | 62.2 | - | | InternVL2-Pro* | 62.0 | **55.7** | | Gemini 1.0 Ultra* | 59.4 | - | | Claude 3 Opus* | 59.4 | - | | GPT-4V(ision) (Playground) | 56.8 | **55.7** | | Reka Core* | 56.3 | - | | Gemini 1.5 Flash* | 56.1 | - | | SenseChat-Vision-0423-Preview* | 54.6 | 50.3 | | Reka Flash* | 53.3 | - | | Claude 3 Sonnet* | 53.1 | - | | HPT Pro* | 52.0 | - | | VILA1.5* | 51.9 | 46.9 | | Qwen-VL-MAX* | 51.4 | 46.8 | | InternVL-Chat-V1.2* | 51.6 | 46.2 | | Skywork-VL* | 51.4 | 46.2 | | LLaVA-1.6-34B* | 51.1 | 44.7 | | Claude 3 Haiku* | 50.2 | - | | Adept Fuyu-Heavy* | 48.3 | - | | Gemini 1.0 Pro* | 47.9 | - | | Marco-VL-Plus* | 46.2 | 44.3 | | Yi-VL-34B* | 45.9 | 41.6 | | Qwen-VL-PLUS* | 45.2 | 40.8 | | HPT Air* | 44.0 | - | | Reka Edge* | 42.8 | - | | Marco-VL* | 41.2 | 40.4 | | OmniLMM-12B* | 41.1 | 40.4 | | Bunny-8B* | 43.3 | 39.0 | | Bunny-4B* | 41.4 | 38.4 | | Weitu-VL-1.0-15B* | - | 38.4 | | InternLM-XComposer2-VL* | 43.0 | 38.2 | | Yi-VL-6B* | 39.1 | 37.8 | | InfiMM-Zephyr-7B* | 39.4 | 35.5 | | InternVL-Chat-V1.1* | 39.1 | 35.3 | | Math-LLaVA-13B* | 38.3 | 34.6 | | SVIT* | 38.0 | 34.1 | | MiniCPM-V* | 37.2 | 34.1 | | MiniCPM-V-2* | 37.1 | - | | Emu2-Chat* | 36.3 | 34.1 | | BLIP-2 FLAN-T5-XXL | 35.4 | 34.0 | | InstructBLIP-T5-XXL | 35.7 | 33.8 | | LLaVA-1.5-13B | 36.4 | 33.6 | | Bunny-3B* | 38.2 | 33.0 | | Qwen-VL-7B-Chat | 35.9 | 32.9 | | SPHINX* | 32.9 | 32.9 | | mPLUG-OWL2* | 32.7 | 32.1 | | BLIP-2 FLAN-T5-XL | 34.4 | 31.0 | | InstructBLIP-T5-XL | 32.9 | 30.6 | | Gemini Nano2* | 32.6 | - | | CogVLM | 32.1 | 30.1 | | Otter | 32.2 | 29.1 | | LLaMA-Adapter2-7B | 29.8 | 27.7 | | MiniGPT4-Vicuna-13B | 26.8 | 27.6 | | Adept Fuyu-8B | 27.9 | 27.4 | | Kosmos2 | 24.4 | 26.6 | | OpenFlamingo2-9B | 28.7 | 26.3 | | Frequent Choice | 22.1 | 23.9 | | Random Choice | 26.8 | 25.8 | *: results provided by the authors. ## Limitations Despite its comprehensive nature, MMMU, like any benchmark, is not without limitations. The manual curation process, albeit thorough, may carry biases. And the focus on college-level subjects might not fully be a sufficient test for Expert AGI. However, we believe it should be necessary for an Expert AGI to achieve strong performance on MMMU to demonstrate their broad and deep subject knowledge as well as expert-level understanding and reasoning capabilities. In future work, we plan to incorporate human evaluations into MMMU. This will provide a more grounded comparison between model capabilities and expert performance, shedding light on the proximity of current AI systems to achieving Expert AGI. ## Disclaimers The guidelines for the annotators emphasized strict compliance with copyright and licensing rules from the initial data source, specifically avoiding materials from websites that forbid copying and redistribution. Should you encounter any data samples potentially breaching the copyright or licensing regulations of any site, we encourage you to notify us. Upon verification, such samples will be promptly removed. ## Contact - Xiang Yue: xiangyue.work@gmail.com - Yu Su: su.809@osu.edu - Wenhu Chen: wenhuchen@uwaterloo.ca ## Citation **BibTeX:** ```bibtex @inproceedings{yue2023mmmu, title={MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI}, author={Xiang Yue and Yuansheng Ni and Kai Zhang and Tianyu Zheng and Ruoqi Liu and Ge Zhang and Samuel Stevens and Dongfu Jiang and Weiming Ren and Yuxuan Sun and Cong Wei and Botao Yu and Ruibin Yuan and Renliang Sun and Ming Yin and Boyuan Zheng and Zhenzhu Yang and Yibo Liu and Wenhao Huang and Huan Sun and Yu Su and Wenhu Chen}, booktitle={Proceedings of CVPR}, year={2024}, } ```

# MMMU（面向专家通用人工智能(AGI)的大规模多学科多模态理解与推理基准数据集） [🌐 主页](https://mmmu-benchmark.github.io/) | [🏆 排行榜](https://mmmu-benchmark.github.io/#leaderboard) | [🤗 数据集](https://huggingface.co/datasets/MMMU/MMMU/) | [🤗 论文](https://huggingface.co/papers/2311.16502) | [📖 arXiv](https://arxiv.org/abs/2311.16502) | [GitHub](https://github.com/MMMU-Benchmark/MMMU) ## 🔔 更新动态 - **🛠️[2024-05-30]: 修复了材料学数据集条目（validation_Materials_25；test_Materials_17、242）中的选项重复问题，以及validation_Materials_25中的内容错误。** - **🛠️[2024-04-30]: 修复了数学数据集条目（dev_Math_2、validation_Math_11、12、16；test_Math_8、23、43、113、164、223、236、287、329、402、498）中缺失的“-”或“^”符号，并修正了validation_Math_2中的选项错误。如遇数据集相关问题，请及时联系我们！** - **🚀[2024-01-31]: 我们在[排行榜](https://mmmu-benchmark.github.io/#leaderboard)中新增了人类专家性能数据！🌟** - **🔥[2023-12-04]: 我们的测试集评估服务器现已在[EvalAI](https://eval.ai/web/challenges/challenge-page/2179/overview)上线。欢迎所有参赛提交，期待您的参与！😆** ## 数据集详情 ### 数据集概述我们推出MMMU：一款全新的基准数据集，旨在评估多模态模型在需要大学层级学科知识与严谨推理能力的大规模多学科任务上的表现。MMMU包含**11500道精心收集的多模态问题**，均来自大学考试、测验与教材，覆盖六大核心学科：艺术与设计、商学、理学、健康与医学、人文与社会科学、技术与工程。这些问题涵盖**30个学科**与**183个子领域**，包含**30种高度异质的图像类型**，例如图表、示意图、地图、表格、乐谱与化学结构式等。我们相信，MMMU将推动社区研发面向专家通用人工智能(AGI)的下一代多模态基础模型。 🎯 **我们已发布完整数据集子集：150个开发集样本与900个验证集样本，同时发布了10500道无答案的测试题。** 开发集用于少样本/上下文学习，验证集用于模型调试、超参数选择或快速评估。测试集问题的答案与解析尚未公开。您可以在**[EvalAI](https://eval.ai/web/challenges/challenge-page/2179/overview)**上提交模型在测试集上的预测结果。 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6230d750d93e84e233882dbc/2Ulh9yznm1dvISV4xJ_Ok.png) ### 数据集构建 MMMU旨在通过需要大学层级学科知识与严谨推理能力的任务，对多模态模型进行挑战，推动这些模型在专家级感知与推理能力上的边界拓展。 MMMU数据集的原始数据由来自不同学科的大学生团队通过在线资源、教材与授课资料手动收集整理。 - **内容构成**：该数据集包含六大宽泛学科（艺术与设计、商学、理学、健康与医学、人文与社会科学、技术与工程）下的11500道大学层级问题，涵盖30个大学学科。 - **图像类型**：数据集包含30种高度异质的图像类型，例如图表、示意图、地图、表格、乐谱与化学结构式等，并与文本交错融合。 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6230d750d93e84e233882dbc/Mbf8O5lEH8I8czprch0AG.png) ## 🏆 小型排行榜我们在此展示小型排行榜，更多信息请参阅我们的论文或[**主页**](https://mmmu-benchmark.github.io/)。 | 模型 | 验证集（900样本） | 测试集（10500样本） | |--------------------------------|:---------:|:------------:| | 专家（最优） | 88.6 | - | | 专家（中等） | 82.6 | - | | 专家（最差） | 76.2 | - | | GPT-4o* | **69.1** | - | | Gemini 1.5 Pro* | 62.2 | - | | InternVL2-Pro* | 62.0 | **55.7** | | Gemini 1.0 Ultra* | 59.4 | - | | Claude 3 Opus* | 59.4 | - | | GPT-4V(ision) (Playground) | 56.8 | **55.7** | | Reka Core* | 56.3 | - | | Gemini 1.5 Flash* | 56.1 | - | | SenseChat-Vision-0423-Preview* | 54.6 | 50.3 | | Reka Flash* | 53.3 | - | | Claude 3 Sonnet* | 53.1 | - | | HPT Pro* | 52.0 | - | | VILA1.5* | 51.9 | 46.9 | | Qwen-VL-MAX* | 51.4 | 46.8 | | InternVL-Chat-V1.2* | 51.6 | 46.2 | | Skywork-VL* | 51.4 | 46.2 | | LLaVA-1.6-34B* | 51.1 | 44.7 | | Claude 3 Haiku* | 50.2 | - | | Adept Fuyu-Heavy* | 48.3 | - | | Gemini 1.0 Pro* | 47.9 | - | | Marco-VL-Plus* | 46.2 | 44.3 | | Yi-VL-34B* | 45.9 | 41.6 | | Qwen-VL-PLUS* | 45.2 | 40.8 | | HPT Air* | 44.0 | - | | Reka Edge* | 42.8 | - | | Marco-VL* | 41.2 | 40.4 | | OmniLMM-12B* | 41.1 | 40.4 | | Bunny-8B* | 43.3 | 39.0 | | Bunny-4B* | 41.4 | 38.4 | | Weitu-VL-1.0-15B* | - | 38.4 | | InternLM-XComposer2-VL* | 43.0 | 38.2 | | Yi-VL-6B* | 39.1 | 37.8 | | InfiMM-Zephyr-7B* | 39.4 | 35.5 | | InternVL-Chat-V1.1* | 39.1 | 35.3 | | Math-LLaVA-13B* | 38.3 | 34.6 | | SVIT* | 38.0 | 34.1 | | MiniCPM-V* | 37.2 | 34.1 | | MiniCPM-V-2* | 37.1 | - | | Emu2-Chat* | 36.3 | 34.1 | | BLIP-2 FLAN-T5-XXL | 35.4 | 34.0 | | InstructBLIP-T5-XXL | 35.7 | 33.8 | | LLaVA-1.5-13B | 36.4 | 33.6 | | Bunny-3B* | 38.2 | 33.0 | | Qwen-VL-7B-Chat | 35.9 | 32.9 | | SPHINX* | 32.9 | 32.9 | | mPLUG-OWL2* | 32.7 | 32.1 | | BLIP-2 FLAN-T5-XL | 34.4 | 31.0 | | InstructBLIP-T5-XL | 32.9 | 30.6 | | Gemini Nano2* | 32.6 | - | | CogVLM | 32.1 | 30.1 | | Otter | 32.2 | 29.1 | | LLaMA-Adapter2-7B | 29.8 | 27.7 | | MiniGPT4-Vicuna-13B | 26.8 | 27.6 | | Adept Fuyu-8B | 27.9 | 27.4 | | Kosmos2 | 24.4 | 26.6 | | OpenFlamingo2-9B | 28.7 | 26.3 | | Frequent Choice | 22.1 | 23.9 | | Random Choice | 26.8 | 25.8 | *：结果由作者提供。 ## 局限性尽管MMMU具有全面性，但与所有基准数据集一样，它并非毫无局限。尽管经过细致的人工整理流程，但仍可能存在偏差。且其聚焦大学层级科目的设定，可能不足以完全满足专家通用人工智能(AGI)的测试需求。然而，我们认为，要展现广泛且深入的学科知识以及专家级的理解与推理能力，专家通用人工智能必须在MMMU上取得优异性能。在未来的工作中，我们计划将人工评估纳入MMMU，这将为模型能力与专家性能之间提供更贴合实际的对比，有助于阐明当前人工智能系统距离实现专家通用人工智能还有多远。 ## 免责声明标注人员的指南强调需严格遵守原始数据源的版权与许可规则，特别避免使用来自禁止复制与分发的网站的材料。如您发现任何可能违反任何网站版权或许可规定的数据样本，欢迎通知我们。经核实后，此类样本将被立即移除。 ## 联系方式 - Xiang Yue: xiangyue.work@gmail.com - Yu Su: su.809@osu.edu - Wenhu Chen: wenhuchen@uwaterloo.ca ## 引用 **BibTeX:** bibtex @inproceedings{yue2023mmmu, title={MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI}, author={Xiang Yue and Yuansheng Ni and Kai Zhang and Tianyu Zheng and Ruoqi Liu and Ge Zhang and Samuel Stevens and Dongfu Jiang and Weiming Ren and Yuxuan Sun and Cong Wei and Botao Yu and Ruibin Yuan and Renliang Sun and Ming Yin and Boyuan Zheng and Zhenzhu Yang and Yibo Liu and Wenhao Huang and Huan Sun and Yu Su and Wenhu Chen}, booktitle={Proceedings of CVPR}, year={2024}, }

应用场景：