MMStar

魔搭社区2026-05-16 更新2025-09-20 收录

下载链接：

https://modelscope.cn/datasets/evalscope/MMStar

下载链接

链接失效反馈

官方服务：

资源简介：

# MMStar (Are We on the Right Way for Evaluating Large Vision-Language Models?) [**🌐 Homepage**](https://mmstar-benchmark.github.io/) | [**🤗 Dataset**](https://huggingface.co/datasets/Lin-Chen/MMStar) | [**🤗 Paper**](https://huggingface.co/papers/2403.20330) | [**📖 arXiv**](https://arxiv.org/pdf/2403.20330.pdf) | [**GitHub**](https://github.com/MMStar-Benchmark/MMStar) ## Dataset Details As shown in the figure below, existing benchmarks lack consideration of the vision dependency of evaluation samples and potential data leakage from LLMs' and LVLMs' training data. <img src="https://raw.githubusercontent.com/MMStar-Benchmark/MMStar/main/resources/4_case_in_1.png" width="80%"> Therefore, we introduce MMStar: an elite vision-indispensible multi-modal benchmark, aiming to ensure each curated sample exhibits **visual dependency**, **minimal data leakage**, and **requires advanced multi-modal capabilities**. 🎯 **We have released a full set comprising 1500 offline-evaluating samples.** After applying the coarse filter process and manual review, we narrow down from a total of 22,401 samples to 11,607 candidate samples and finally select 1,500 high-quality samples to construct our MMStar benchmark. <img src="https://raw.githubusercontent.com/MMStar-Benchmark/MMStar/main/resources/data_source.png" width="80%"> In MMStar, we display **6 core capabilities** in the inner ring, with **18 detailed axes** presented in the outer ring. The middle ring showcases the number of samples for each detailed dimension. Each core capability contains a meticulously **balanced 250 samples**. We further ensure a relatively even distribution across the 18 detailed axes. <img src="https://raw.githubusercontent.com/MMStar-Benchmark/MMStar/main/resources/mmstar.png" width="60%"> ## 🏆 Mini-Leaderboard We show a mini-leaderboard here and please find more information in our paper or [homepage](https://mmstar-benchmark.github.io/). | Model | Acc. | MG ⬆ | ML ⬇ | |----------------------------|:---------:|:------------:|:------------:| | GPT4V (high)| **57.1** | **43.6** | 1.3 | | InternLM-Xcomposer2| 55.4 | 28.1 | 7.5| | LLaVA-Next-34B |52.1|29.4|2.4| |GPT4V (low)|46.1|32.6|1.3| |InternVL-Chat-v1.2|43.7|32.6|**0.0**| |GeminiPro-Vision|42.6|27.4|**0.0**| |Sphinx-X-MoE|38.9|14.8|1.0| |Monkey-Chat|38.3|13.5|17.6| |Yi-VL-6B|37.9|15.6|**0.0**| |Qwen-VL-Chat|37.5|23.9|**0.0**| |Deepseek-VL-7B|37.1|15.7|**0.0**| |CogVLM-Chat|36.5|14.9|**0.0**| |Yi-VL-34B|36.1|18.8|**0.0**| |TinyLLaVA|36.0|16.4|7.6| |ShareGPT4V-7B|33.0|11.9|**0.0**| |LLaVA-1.5-13B|32.8|13.9|**0.0**| |LLaVA-1.5-7B|30.3|10.7|**0.0**| |Random Choice|24.6|-|-| ## 📧 Contact - [Lin Chen](https://lin-chen.site/): chlin@mail.ustc.edu.cn - [Jinsong Li](https://li-jinsong.github.io/): lijingsong@pjlab.org.cn ## ✒️ Citation If you find our work helpful for your research, please consider giving a star ⭐ and citation 📝 ```bibtex @article{chen2024we, title={Are We on the Right Way for Evaluating Large Vision-Language Models?}, author={Chen, Lin and Li, Jinsong and Dong, Xiaoyi and Zhang, Pan and Zang, Yuhang and Chen, Zehui and Duan, Haodong and Wang, Jiaqi and Qiao, Yu and Lin, Dahua and others}, journal={arXiv preprint arXiv:2403.20330}, year={2024} } ```

# MMStar（我们是否走在了正确评估大视觉语言模型的道路上？） [**🌐 主页**](https://mmstar-benchmark.github.io/) | [**🤗 数据集**](https://huggingface.co/datasets/Lin-Chen/MMStar) | [**🤗 论文**](https://huggingface.co/papers/2403.20330) | [**📖 arXiv**](https://arxiv.org/pdf/2403.20330.pdf) | [**GitHub**](https://github.com/MMStar-Benchmark/MMStar) ## 数据集详情如下图所示，现有基准缺乏对评估样本视觉依赖性以及大语言模型（LLM）、大视觉语言模型（LVLM）训练数据中潜在数据泄露问题的考量。 <img src="https://raw.githubusercontent.com/MMStar-Benchmark/MMStar/main/resources/4_case_in_1.png" width="80%"> 为此，我们推出MMStar：一款精选的视觉不可替代型多模态基准，旨在确保每一个精选样本均具备**视觉依赖性**、**极低的数据泄露风险**，且**需要运用高级多模态能力**方可完成解答。 🎯 **我们已发布全套共1500个离线评测样本。** 经过粗筛选流程与人工审核，我们从总计22401个样本中筛选出11607个候选样本，最终精选1500个高质量样本构建MMStar基准。 <img src="https://raw.githubusercontent.com/MMStar-Benchmark/MMStar/main/resources/data_source.png" width="80%"> 在MMStar中，内环展示**6项核心能力**，外环呈现**18个细分维度**，中环则标注各细分维度对应的样本数量。每项核心能力均包含经过精心均衡配置的250个样本，我们同时确保18个细分维度的样本分布相对均匀。 <img src="https://raw.githubusercontent.com/MMStar-Benchmark/MMStar/main/resources/mmstar.png" width="60%"> ## 🏆 迷你评测排行榜我们在此展示迷你评测排行榜，更多详细信息请参阅我们的论文或[主页](https://mmstar-benchmark.github.io/)。 | 模型 | 准确率（Acc.） | MG ⬆ | ML ⬇ | |----------------------------|:---------:|:------------:|:------------:| | GPT-4V（高精度模式）| **57.1** | **43.6** | 1.3 | | InternLM-Xcomposer2| 55.4 | 28.1 | 7.5| | LLaVA-Next-34B |52.1|29.4|2.4| |GPT-4V（低精度模式）|46.1|32.6|1.3| |InternVL-Chat-v1.2|43.7|32.6|**0.0**|` |GeminiPro-Vision|42.6|27.4|**0.0**|` |Sphinx-X-MoE|38.9|14.8|1.0| |Monkey-Chat|38.3|13.5|17.6| |Yi-VL-6B|37.9|15.6|**0.0**|` |Qwen-VL-Chat|37.5|23.9|**0.0**|` |Deepseek-VL-7B|37.1|15.7|**0.0**|` |CogVLM-Chat|36.5|14.9|**0.0**|` |Yi-VL-34B|36.1|18.8|**0.0**|` |TinyLLaVA|36.0|16.4|7.6| |ShareGPT4V-7B|33.0|11.9|**0.0**|` |LLaVA-1.5-13B|32.8|13.9|**0.0**|` |LLaVA-1.5-7B|30.3|10.7|**0.0**|` |随机猜测|24.6|-|-| ## 📧 联系方式 - [陈林（Lin Chen）](https://lin-chen.site/): chlin@mail.ustc.edu.cn - [李劲松（Jinsong Li）](https://li-jinsong.github.io/): lijingsong@pjlab.org.cn ## ✒️ 引用如果您的研究中用到了我们的工作，请不吝点亮⭐并引用我们的论文📝 bibtex @article{chen2024we, title={Are We on the Right Way for Evaluating Large Vision-Language Models?}, author={Chen, Lin and Li, Jinsong and Dong, Xiaoyi and Zhang, Pan and Zang, Yuhang and Chen, Zehui and Duan, Haodong and Wang, Jiaqi and Qiao, Yu and Lin, Dahua and others}, journal={arXiv preprint arXiv:2403.20330}, year={2024} }

提供机构：

maas

创建时间：

2025-09-12

5,000+

优质数据集

54 个

任务类型

进入经典数据集