five

InternScience/MME-Reasoning

收藏
Hugging Face2025-06-13 更新2026-02-07 收录
下载链接:
https://hf-mirror.com/datasets/InternScience/MME-Reasoning
下载链接
链接失效反馈
官方服务:
资源简介:
--- task_categories: - visual-question-answering language: - en pretty_name: i size_categories: - 1K<n<10K --- # MME-Reasoning 🔥: A Comprehensive Benchmark for Logical Reasoning in MLLMs ![Multimodal Reasoning](https://img.shields.io/badge/Task-Multimodal_Reasoning-red) ![Visual Reasoning](https://img.shields.io/badge/Task-Visual_Reasoning-red) ![MME-Reasoning](https://img.shields.io/badge/Dataset-MME--Reasoning-blue) ![OpenAI o4-mini](https://img.shields.io/badge/Model-OpenAI_o4--mini-green) ![Seed1.5-VL-Thinking](https://img.shields.io/badge/Model-Seed1.5--VL--Thinking-green) ![Gemini2.5-Pro-Thinking](https://img.shields.io/badge/Model-Gemini2.5--Pro--Thinking-green) Official repository for "[MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs]()". 🌟 For more details, please refer to the project page. [[🚀Project Page](https://alpha-innovator.github.io/mmereasoning.github.io/)] [[📖 Paper](https://arxiv.org/pdf/2505.21327)] [[🗃️ Github](https://github.com/Alpha-Innovator/MME-Reasoning)] [[🏆 Leaderboard](https://alpha-innovator.github.io/mmereasoning.github.io/#leaderboard)] ## 💥 News - **[2025.05.23]** 🔥 We launch MME-Reasoning, a comprehensive benchmark designed to evaluate the reasoning ability of MLLMs. We release the [arxiv paper]() and all data samples in [huggingface dataset](). ## 👀 About MME-Reasoning Logical reasoning is a fundamental aspect of human intelligence and an essential capability for multimodal large language models (MLLMs). Existing benchmarks fail to comprehensively evaluate MLLMs reasoning abilities due to the lack of explicit categorization for logical reasoning types and an unclear understanding of reasoning. In this paper, we introduce **MME-Reasoning**, a comprehensive benchmark specifically designed to evaluate the reasoning capability of MLLMs. MME-Reasoning consists of 1,188 carefully curated questions that systematically cover types of logical reasoning (**inductive**, **deductive**, and **abductive**), while spanning a range of difficulty levels. <!-- <p align="center"> <img src="teaser.png" width="70%"> <br> </p> --> Experiments were conducted on state-of-the-art MLLMs, covering Chat and Thinking types of both open-source and closed-source. Evaluations with MME-Reasoning reveal these key findings: **(1) MLLMs exhibit significant limitations and pronounced imbalances in reasoning capabilities.** **(2) Abductive reasoning remains a major bottleneck for current MLLMs.** **(3) Reasoning length scales with task difficulty, benefiting performance but accompanied by marginal effects and decreasing token efficiency.** We hope MME-Reasoning serves as a foundation for advancing multimodal reasoning in MLLMs. <!-- <p align="center"> <img src="performance.png" width="95%"> <br> </p> --> ## Inference We are working to integrate the MME-Reasoning into existing VLMs evaluation frameworks. For the current version of the evaluation, please following the follows steps: 1. Setup your environment following [VLMEvalKit](./README_VLMEVAL.md) 2. Download MME-Reasoning data and metadata from [huggingface](). 3. Set environment variable `LMUData` (note the images should exist under `$LMUDATA/MMEReasoning/images/`) 4. Set the metadata path in `vlmeval/dataset/mmereasoning/mmereasoning.py` in `line 19` and `line 25`. 5. Run: ```python python run.py --data MMEReasoning --model your_model --mode infer --verbose ``` 6. Extract and judge the final results: ```python python test_mme_reasoning.py --file_path response_file ``` The response file exists in outputs dir and ends with scores.xlsx. ## 🏆 Leaderboard ### Contributing to the Leaderboard 🚀 The [Leaderboard](https://alpha-innovator.github.io/mmereasoning.github.io/#leaderboard) is continuously being updated, welcoming the contribution of your excellent MLLMs! To contribute your model to the leaderboard, please email the prediction files to 📧[jkyuan112@gmail.com](mailto:jkyuan112@gmail.com) or [pengts521@gmail.com](mailto:pengts521@gmail.com). If you find **MME-Reasoning** useful for your research and applications, please kindly cite using this BibTeX: ```latex @article{yuan2025mme, title={MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs}, author={Yuan, Jiakang and Peng, Tianshuo and Jiang, Yilei and Lu, Yiting and Zhang, Renrui and Feng, Kaituo and Fu, Chaoyou and Chen, Tao and Bai, Lei and Zhang, Bo and others}, journal={arXiv preprint arXiv:2505.21327}, year={2025} } ```

--- 任务类别: - 视觉问答 语言: - 英语 数据集昵称:i 样本量级: - 1000 < 样本数 < 10000 --- # MME-Reasoning 🔥:面向多模态大语言模型逻辑推理的综合基准测试 ![多模态推理](https://img.shields.io/badge/任务-多模态推理-red) ![视觉推理](https://img.shields.io/badge/任务-视觉推理-red) ![MME-Reasoning](https://img.shields.io/badge/数据集-MME--Reasoning-blue) ![OpenAI o4-mini](https://img.shields.io/badge/模型-OpenAI_o4--mini-green) ![Seed1.5-VL-Thinking](https://img.shields.io/badge/模型-Seed1.5--VL--Thinking-green) ![Gemini2.5-Pro-Thinking](https://img.shields.io/badge/模型-Gemini2.5--Pro--Thinking-green) 《MME-Reasoning:面向多模态大语言模型逻辑推理的综合基准测试》的官方仓库。 🌟 更多详情请参阅项目主页。 [[🚀项目主页](https://alpha-innovator.github.io/mmereasoning.github.io/)] [[📖论文](https://arxiv.org/pdf/2505.21327)] [[🗃️GitHub仓库](https://github.com/Alpha-Innovator/MME-Reasoning)] [[🏆排行榜](https://alpha-innovator.github.io/mmereasoning.github.io/#leaderboard)] ## 💥 最新动态 - **[2025.05.23]** 🔥 我们正式发布MME-Reasoning——一款专为评估多模态大语言模型(Multimodal Large Language Model, MLLM)推理能力打造的综合基准测试。我们已在arXiv平台发布相关预印本论文,并在Hugging Face数据集平台开放全部数据样本。 ## 👀 关于MME-Reasoning 逻辑推理是人类智能的核心组成部分,亦是多模态大语言模型的必备能力。现有基准测试因缺乏对逻辑推理类型的明确分类,且对推理本质的认知模糊,无法全面评估多模态大语言模型的推理能力。 在本工作中,我们提出**MME-Reasoning**——一款专为评估多模态大语言模型推理能力打造的综合基准测试。MME-Reasoning包含1188个精心筛选的问题,系统性覆盖了归纳推理(inductive)、演绎推理(deductive)与溯因推理(abductive)三类逻辑推理类型,并涵盖不同难度层级。 我们在当前主流的多模态大语言模型上开展了实验,覆盖开源与闭源的对话型及思维型模型。基于MME-Reasoning的评估结果揭示了以下关键发现:**(1) 多模态大语言模型在推理能力上存在显著局限与明显的能力失衡;(2) 溯因推理仍是当前多模态大语言模型的主要瓶颈;(3) 推理长度随任务难度提升而增加,虽能提升模型性能,但同时会带来边际效益递减与令牌效率下降的问题**。我们期望MME-Reasoning能够成为推动多模态大语言模型多模态推理能力发展的基础基准。 ## 🧪 评测流程 我们正致力于将MME-Reasoning集成至现有的视觉语言模型(Vision-Language Model, VLM)评测框架中。针对当前版本的评测,请遵循以下步骤: 1. 按照[VLMEvalKit](./README_VLMEVAL.md)的说明配置运行环境 2. 从Hugging Face平台下载MME-Reasoning数据集与元数据 3. 设置环境变量`LMUData`(请注意,图像文件需存放于`$LMUDATA/MMEReasoning/images/`路径下) 4. 在`vlmeval/dataset/mmereasoning/mmereasoning.py`的第19行与第25行中配置元数据路径 5. 运行如下命令: python python run.py --data MMEReasoning --model your_model --mode infer --verbose 6. 提取并判定最终结果: python python test_mme_reasoning.py --file_path response_file 结果文件将存储于输出目录中,以`scores.xlsx`结尾。 ## 🏆 排行榜 ### 参与排行榜贡献 🚀 本[排行榜](https://alpha-innovator.github.io/mmereasoning.github.io/#leaderboard)仍在持续更新中,欢迎您贡献优秀的多模态大语言模型评测结果! 若您希望将自己的模型加入排行榜,请将预测结果文件发送至📧[jkyuan112@gmail.com](mailto:jkyuan112@gmail.com)或📧[pengts521@gmail.com](mailto:pengts521@gmail.com)。 如果您发现**MME-Reasoning**对您的研究与应用有所帮助,请使用以下BibTeX格式引用本论文: latex @article{yuan2025mme, title={MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs}, author={Yuan, Jiakang and Peng, Tianshuo and Jiang, Yilei and Lu, Yiting and Zhang, Renrui and Feng, Kaituo and Fu, Chaoyou and Chen, Tao and Bai, Lei and Zhang, Bo and others}, journal={arXiv preprint arXiv:2505.21327}, year={2025} }
提供机构:
InternScience
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作