下载链接：

https://modelscope.cn/datasets/yuexi0316/beifen

下载链接

链接失效反馈

官方服务：

资源简介：

# MMMU (A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI) [**🌐 Homepage**](https://mmmu-benchmark.github.io/) | [**🏆 Leaderboard**](https://mmmu-benchmark.github.io/#leaderboard) | [**🤗 Dataset**](https://huggingface.co/datasets/MMMU/MMMU/) | [**🤗 Paper**](https://huggingface.co/papers/2311.16502) | [**📖 arXiv**](https://arxiv.org/abs/2311.16502) | [**GitHub**](https://github.com/MMMU-Benchmark/MMMU) ## 🔔News - **🛠️[2024-05-30]: Fixed duplicate option issues in Materials dataset items (validation_Materials_25; test_Materials_17, 242) and content error in validation_Materials_25.** - **🛠️[2024-04-30]: Fixed missing "-" or "^" signs in Math dataset items (dev_Math_2, validation_Math_11, 12, 16; test_Math_8, 23, 43, 113, 164, 223, 236, 287, 329, 402, 498) and corrected option errors in validation_Math_2. If you encounter any issues with the dataset, please contact us promptly!** - **🚀[2024-01-31]: We added Human Expert performance on the [Leaderboard](https://mmmu-benchmark.github.io/#leaderboard)!🌟** - **🔥[2023-12-04]: Our evaluation server for test set is now availble on [EvalAI](https://eval.ai/web/challenges/challenge-page/2179/overview). We welcome all submissions and look forward to your participation! 😆** ## Dataset Details ### Dataset Description We introduce MMMU: a new benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning. MMMU includes **11.5K meticulously collected multimodal questions** from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. These questions span **30 subjects** and **183 subfields**, comprising **30 highly heterogeneous image types**, such as charts, diagrams, maps, tables, music sheets, and chemical structures. We believe MMMU will stimulate the community to build next-generation multimodal foundation models towards expert artificial general intelligence (AGI). 🎯 **We have released a full set comprising 150 development samples and 900 validation samples. We have released 10,500 test questions without their answers.** The development set is used for few-shot/in-context learning, and the validation set is used for debugging models, selecting hyperparameters, or quick evaluations. The answers and explanations for the test set questions are withheld. You can submit your model's predictions for the **test set** on **[EvalAI](https://eval.ai/web/challenges/challenge-page/2179/overview)**. ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6230d750d93e84e233882dbc/2Ulh9yznm1dvISV4xJ_Ok.png) ### Dataset Creation MMMU was created to challenge multimodal models with tasks that demand college-level subject knowledge and deliberate reasoning, pushing the boundaries of what these models can achieve in terms of expert-level perception and reasoning. The data for the MMMU dataset was manually collected by a team of college students from various disciplines, using online sources, textbooks, and lecture materials. - **Content:** The dataset contains 11.5K college-level problems across six broad disciplines (Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, Tech & Engineering) and 30 college subjects. - **Image Types:** The dataset includes 30 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures, interleaved with text. ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6230d750d93e84e233882dbc/Mbf8O5lEH8I8czprch0AG.png) ## 🏆 Mini-Leaderboard We show a mini-leaderboard here and please find more information in our paper or [**homepage**](https://mmmu-benchmark.github.io/). | Model | Val (900) | Test (10.5K) | |--------------------------------|:---------:|:------------:| | Expert (Best) | 88.6 | - | | Expert (Medium) | 82.6 | - | | Expert (Worst) | 76.2 | - | | GPT-4o* | **69.1** | - | | Gemini 1.5 Pro* | 62.2 | - | | InternVL2-Pro* | 62.0 | **55.7** | | Gemini 1.0 Ultra* | 59.4 | - | | Claude 3 Opus* | 59.4 | - | | GPT-4V(ision) (Playground) | 56.8 | **55.7** | | Reka Core* | 56.3 | - | | Gemini 1.5 Flash* | 56.1 | - | | SenseChat-Vision-0423-Preview* | 54.6 | 50.3 | | Reka Flash* | 53.3 | - | | Claude 3 Sonnet* | 53.1 | - | | HPT Pro* | 52.0 | - | | VILA1.5* | 51.9 | 46.9 | | Qwen-VL-MAX* | 51.4 | 46.8 | | InternVL-Chat-V1.2* | 51.6 | 46.2 | | Skywork-VL* | 51.4 | 46.2 | | LLaVA-1.6-34B* | 51.1 | 44.7 | | Claude 3 Haiku* | 50.2 | - | | Adept Fuyu-Heavy* | 48.3 | - | | Gemini 1.0 Pro* | 47.9 | - | | Marco-VL-Plus* | 46.2 | 44.3 | | Yi-VL-34B* | 45.9 | 41.6 | | Qwen-VL-PLUS* | 45.2 | 40.8 | | HPT Air* | 44.0 | - | | Reka Edge* | 42.8 | - | | Marco-VL* | 41.2 | 40.4 | | OmniLMM-12B* | 41.1 | 40.4 | | Bunny-8B* | 43.3 | 39.0 | | Bunny-4B* | 41.4 | 38.4 | | Weitu-VL-1.0-15B* | - | 38.4 | | InternLM-XComposer2-VL* | 43.0 | 38.2 | | Yi-VL-6B* | 39.1 | 37.8 | | InfiMM-Zephyr-7B* | 39.4 | 35.5 | | InternVL-Chat-V1.1* | 39.1 | 35.3 | | Math-LLaVA-13B* | 38.3 | 34.6 | | SVIT* | 38.0 | 34.1 | | MiniCPM-V* | 37.2 | 34.1 | | MiniCPM-V-2* | 37.1 | - | | Emu2-Chat* | 36.3 | 34.1 | | BLIP-2 FLAN-T5-XXL | 35.4 | 34.0 | | InstructBLIP-T5-XXL | 35.7 | 33.8 | | LLaVA-1.5-13B | 36.4 | 33.6 | | Bunny-3B* | 38.2 | 33.0 | | Qwen-VL-7B-Chat | 35.9 | 32.9 | | SPHINX* | 32.9 | 32.9 | | mPLUG-OWL2* | 32.7 | 32.1 | | BLIP-2 FLAN-T5-XL | 34.4 | 31.0 | | InstructBLIP-T5-XL | 32.9 | 30.6 | | Gemini Nano2* | 32.6 | - | | CogVLM | 32.1 | 30.1 | | Otter | 32.2 | 29.1 | | LLaMA-Adapter2-7B | 29.8 | 27.7 | | MiniGPT4-Vicuna-13B | 26.8 | 27.6 | | Adept Fuyu-8B | 27.9 | 27.4 | | Kosmos2 | 24.4 | 26.6 | | OpenFlamingo2-9B | 28.7 | 26.3 | | Frequent Choice | 22.1 | 23.9 | | Random Choice | 26.8 | 25.8 | *: results provided by the authors. ## Limitations Despite its comprehensive nature, MMMU, like any benchmark, is not without limitations. The manual curation process, albeit thorough, may carry biases. And the focus on college-level subjects might not fully be a sufficient test for Expert AGI. However, we believe it should be necessary for an Expert AGI to achieve strong performance on MMMU to demonstrate their broad and deep subject knowledge as well as expert-level understanding and reasoning capabilities. In future work, we plan to incorporate human evaluations into MMMU. This will provide a more grounded comparison between model capabilities and expert performance, shedding light on the proximity of current AI systems to achieving Expert AGI. ## Disclaimers The guidelines for the annotators emphasized strict compliance with copyright and licensing rules from the initial data source, specifically avoiding materials from websites that forbid copying and redistribution. Should you encounter any data samples potentially breaching the copyright or licensing regulations of any site, we encourage you to notify us. Upon verification, such samples will be promptly removed. ## Contact - Xiang Yue: xiangyue.work@gmail.com - Yu Su: su.809@osu.edu - Wenhu Chen: wenhuchen@uwaterloo.ca ## Citation **BibTeX:** ```bibtex @inproceedings{yue2023mmmu, title={MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI}, author={Xiang Yue and Yuansheng Ni and Kai Zhang and Tianyu Zheng and Ruoqi Liu and Ge Zhang and Samuel Stevens and Dongfu Jiang and Weiming Ren and Yuxuan Sun and Cong Wei and Botao Yu and Ruibin Yuan and Renliang Sun and Ming Yin and Boyuan Zheng and Zhenzhu Yang and Yibo Liu and Wenhao Huang and Huan Sun and Yu Su and Wenhu Chen}, booktitle={Proceedings of CVPR}, year={2024}, } ```

# MMMU（面向专家通用人工智能(AGI)的大规模多学科多模态理解与推理基准） [**🌐 主页**](https://mmmu-benchmark.github.io/) | [**🏆 排行榜**](https://mmmu-benchmark.github.io/#leaderboard) | [**🤗 数据集**](https://huggingface.co/datasets/MMMU/MMMU/) | [**🤗 论文**](https://huggingface.co/papers/2311.16502) | [**📖 arXiv**](https://arxiv.org/abs/2311.16502) | [**GitHub**](https://github.com/MMMU-Benchmark/MMMU) ## 🔔 最新动态 - **🛠️[2024-05-30]：修复了Materials数据集条目中的重复选项问题（validation_Materials_25；test_Materials_17、242）以及validation_Materials_25中的内容错误。** - **🛠️[2024-04-30]：修复了Math数据集条目中缺失的“-”或“^”符号（dev_Math_2、validation_Math_11、12、16；test_Math_8、23、43、113、164、223、236、287、329、402、498），并修正了validation_Math_2中的选项错误。若您遇到数据集相关问题，请及时联系我们！** - **🚀[2024-01-31]：我们已在[排行榜](https://mmmu-benchmark.github.io/#leaderboard)中新增人类专家表现数据！🌟** - **🔥[2023-12-04]：我们的测试集评估服务器现已在[EvalAI](https://eval.ai/web/challenges/challenge-page/2179/overview)上线，欢迎所有开发者提交结果，期待您的参与！😆** ## 数据集详情 ### 数据集概述我们推出MMMU：一款全新的基准测试集，用于评估多模态模型在需要大学水平学科知识与严谨推理的大规模多学科任务上的性能。MMMU包含**11.5万个精心收集的多模态问题**，数据来源于大学考试、测验与教科书，覆盖六大核心学科：艺术与设计、商科、理科、健康与医学、人文与社会科学、技术与工程。这些问题涵盖**30个学科**与**183个子领域**，包含**30种高度异质的图像类型**，例如图表、示意图、地图、表格、乐谱与化学结构式。我们相信MMMU将推动社区构建面向专家通用人工智能(AGI)的下一代多模态基础模型。 🎯 **我们已发布完整子集，包含150个开发集样本与900个验证集样本，同时发布了10500道无答案的测试题。** 开发集用于少样本/上下文学习，验证集用于模型调试、超参数选择或快速评估。测试集问题的答案与解析未公开，您可在**[EvalAI](https://eval.ai/web/challenges/challenge-page/2179/overview)** 上提交模型针对**测试集**的预测结果。 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6230d750d93e84e233882dbc/2Ulh9yznm1dvISV4xJ_Ok.png) ### 数据集构建 MMMU旨在通过要求大学水平学科知识与严谨推理的任务来挑战多模态模型，推动这些模型在专家级感知与推理能力上的边界拓展。MMMU数据集的数据由来自不同学科的大学生团队通过在线资源、教科书与讲义材料手动收集。 - **内容范畴**：该数据集包含六大宽泛学科（艺术与设计、商科、理科、健康与医学、人文与社会科学、技术与工程）下的11.5K个大学水平问题，涵盖30个大学学科。 - **图像类型**：数据集包含30种高度异质的图像类型，例如图表、示意图、地图、表格、乐谱与化学结构式，与文本交错呈现。 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6230d750d93e84e233882dbc/Mbf8O5lEH8I8czprch0AG.png) ## 🏆 迷你排行榜此处展示迷你排行榜，更多信息请参阅我们的论文或[**主页**](https://mmmu-benchmark.github.io/)。 | 模型名称 | 验证集（900样本） | 测试集（10.5K样本） | |--------------------------------|:---------:|:------------:| | 专家（最优表现） | 88.6 | - | | 专家（中等表现） | 82.6 | - | | 专家（最差表现） | 76.2 | - | | GPT-4o* | **69.1** | - | | Gemini 1.5 Pro* | 62.2 | - | | InternVL2-Pro* | 62.0 | **55.7** | | Gemini 1.0 Ultra* | 59.4 | - | | Claude 3 Opus* | 59.4 | - | | GPT-4V(ision)（Playground） | 56.8 | **55.7** | | Reka Core* | 56.3 | - | | Gemini 1.5 Flash* | 56.1 | - | | SenseChat-Vision-0423-Preview* | 54.6 | 50.3 | | Reka Flash* | 53.3 | - | | Claude 3 Sonnet* | 53.1 | - | | HPT Pro* | 52.0 | - | | VILA1.5* | 51.9 | 46.9 | | Qwen-VL-MAX* | 51.4 | 46.8 | | InternVL-Chat-V1.2* | 51.6 | 46.2 | | Skywork-VL* | 51.4 | 46.2 | | LLaVA-1.6-34B* | 51.1 | 44.7 | | Claude 3 Haiku* | 50.2 | - | | Adept Fuyu-Heavy* | 48.3 | - | | Gemini 1.0 Pro* | 47.9 | - | | Marco-VL-Plus* | 46.2 | 44.3 | | Yi-VL-34B* | 45.9 | 41.6 | | Qwen-VL-PLUS* | 45.2 | 40.8 | | HPT Air* | 44.0 | - | | Reka Edge* | 42.8 | - | | Marco-VL* | 41.2 | 40.4 | | OmniLMM-12B* | 41.1 | 40.4 | | Bunny-8B* | 43.3 | 39.0 | | Bunny-4B* | 41.4 | 38.4 | | Weitu-VL-1.0-15B* | - | 38.4 | | InternLM-XComposer2-VL* | 43.0 | 38.2 | | Yi-VL-6B* | 39.1 | 37.8 | | InfiMM-Zephyr-7B* | 39.4 | 35.5 | | InternVL-Chat-V1.1* | 39.1 | 35.3 | | Math-LLaVA-13B* | 38.3 | 34.6 | | SVIT* | 38.0 | 34.1 | | MiniCPM-V* | 37.2 | 34.1 | | MiniCPM-V-2* | 37.1 | - | | Emu2-Chat* | 36.3 | 34.1 | | BLIP-2 FLAN-T5-XXL | 35.4 | 34.0 | | InstructBLIP-T5-XXL | 35.7 | 33.8 | | LLaVA-1.5-13B | 36.4 | 33.6 | | Bunny-3B* | 38.2 | 33.0 | | Qwen-VL-7B-Chat | 35.9 | 32.9 | | SPHINX* | 32.9 | 32.9 | | mPLUG-OWL2* | 32.7 | 32.1 | | BLIP-2 FLAN-T5-XL | 34.4 | 31.0 | | InstructBLIP-T5-XL | 32.9 | 30.6 | | Gemini Nano2* | 32.6 | - | | CogVLM | 32.1 | 30.1 | | Otter | 32.2 | 29.1 | | LLaMA-Adapter2-7B | 29.8 | 27.7 | | MiniGPT4-Vicuna-13B | 26.8 | 27.6 | | Adept Fuyu-8B | 27.9 | 27.4 | | Kosmos2 | 24.4 | 26.6 | | OpenFlamingo2-9B | 28.7 | 26.3 | | Frequent Choice | 22.1 | 23.9 | | Random Choice | 26.8 | 25.8 | *：结果由作者提供。 ## 局限性说明尽管具备全面性，但MMMU与其他基准测试一样，并非毫无局限。尽管人工筛选过程细致严谨，但仍可能引入偏差。且其聚焦大学水平学科的设定，或许不足以完全验证专家通用人工智能的能力。但我们认为，若要展现广泛且深入的学科知识，以及专家级的理解与推理能力，专家通用人工智能在MMMU上取得优异性能是必要前提。在未来工作中，我们计划将人工评估纳入MMMU，这将为模型能力与专家表现提供更贴合实际的对比，阐明当前人工智能系统距离实现专家通用人工智能的差距。 ## 免责声明标注者指南强调需严格遵守原始数据源的版权与许可规则，特别避免使用来自禁止复制与再分发的网站的材料。若您发现任何可能违反任何网站版权或许可规定的数据样本，欢迎通知我们。经核实后，此类样本将被立即移除。 ## 联系方式 - 岳翔：xiangyue.work@gmail.com - 苏煜：su.809@osu.edu - 陈文虎：wenhuchen@uwaterloo.ca ## 引用格式 **BibTeX：** bibtex @inproceedings{yue2023mmmu, title={MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI}, author={Xiang Yue and Yuansheng Ni and Kai Zhang and Tianyu Zheng and Ruoqi Liu and Ge Zhang and Samuel Stevens and Dongfu Jiang and Weiming Ren and Yuxuan Sun and Cong Wei and Botao Yu and Ruibin Yuan and Renliang Sun and Ming Yin and Boyuan Zheng and Zhenzhu Yang and Yibo Liu and Wenhao Huang and Huan Sun and Yu Su and Wenhu Chen}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, year={2024}, }

应用场景：