MMStar
收藏魔搭社区2026-05-16 更新2025-09-20 收录
下载链接:
https://modelscope.cn/datasets/evalscope/MMStar
下载链接
链接失效反馈官方服务:
资源简介:
# MMStar (Are We on the Right Way for Evaluating Large Vision-Language Models?)
[**🌐 Homepage**](https://mmstar-benchmark.github.io/) | [**🤗 Dataset**](https://huggingface.co/datasets/Lin-Chen/MMStar) | [**🤗 Paper**](https://huggingface.co/papers/2403.20330) | [**📖 arXiv**](https://arxiv.org/pdf/2403.20330.pdf) | [**GitHub**](https://github.com/MMStar-Benchmark/MMStar)
## Dataset Details
As shown in the figure below, existing benchmarks lack consideration of the vision dependency of evaluation samples and potential data leakage from LLMs' and LVLMs' training data.
<p align="center">
<img src="https://raw.githubusercontent.com/MMStar-Benchmark/MMStar/main/resources/4_case_in_1.png" width="80%"> <br>
</p>
Therefore, we introduce MMStar: an elite vision-indispensible multi-modal benchmark, aiming to ensure each curated sample exhibits **visual dependency**, **minimal data leakage**, and **requires advanced multi-modal capabilities**.
🎯 **We have released a full set comprising 1500 offline-evaluating samples.** After applying the coarse filter process and manual review, we narrow down from a total of 22,401 samples to 11,607 candidate samples and finally select 1,500 high-quality samples to construct our MMStar benchmark.
<p align="center">
<img src="https://raw.githubusercontent.com/MMStar-Benchmark/MMStar/main/resources/data_source.png" width="80%"> <br>
</p>
In MMStar, we display **6 core capabilities** in the inner ring, with **18 detailed axes** presented in the outer ring. The middle ring showcases the number of samples for each detailed dimension. Each core capability contains a meticulously **balanced 250 samples**. We further ensure a relatively even distribution across the 18 detailed axes.
<p align="center">
<img src="https://raw.githubusercontent.com/MMStar-Benchmark/MMStar/main/resources/mmstar.png" width="60%"> <br>
</p>
## 🏆 Mini-Leaderboard
We show a mini-leaderboard here and please find more information in our paper or [homepage](https://mmstar-benchmark.github.io/).
| Model | Acc. | MG ⬆ | ML ⬇ |
|----------------------------|:---------:|:------------:|:------------:|
| GPT4V (high)| **57.1** | **43.6** | 1.3 |
| InternLM-Xcomposer2| 55.4 | 28.1 | 7.5|
| LLaVA-Next-34B |52.1|29.4|2.4|
|GPT4V (low)|46.1|32.6|1.3|
|InternVL-Chat-v1.2|43.7|32.6|**0.0**|
|GeminiPro-Vision|42.6|27.4|**0.0**|
|Sphinx-X-MoE|38.9|14.8|1.0|
|Monkey-Chat|38.3|13.5|17.6|
|Yi-VL-6B|37.9|15.6|**0.0**|
|Qwen-VL-Chat|37.5|23.9|**0.0**|
|Deepseek-VL-7B|37.1|15.7|**0.0**|
|CogVLM-Chat|36.5|14.9|**0.0**|
|Yi-VL-34B|36.1|18.8|**0.0**|
|TinyLLaVA|36.0|16.4|7.6|
|ShareGPT4V-7B|33.0|11.9|**0.0**|
|LLaVA-1.5-13B|32.8|13.9|**0.0**|
|LLaVA-1.5-7B|30.3|10.7|**0.0**|
|Random Choice|24.6|-|-|
## 📧 Contact
- [Lin Chen](https://lin-chen.site/): chlin@mail.ustc.edu.cn
- [Jinsong Li](https://li-jinsong.github.io/): lijingsong@pjlab.org.cn
## ✒️ Citation
If you find our work helpful for your research, please consider giving a star ⭐ and citation 📝
```bibtex
@article{chen2024we,
title={Are We on the Right Way for Evaluating Large Vision-Language Models?},
author={Chen, Lin and Li, Jinsong and Dong, Xiaoyi and Zhang, Pan and Zang, Yuhang and Chen, Zehui and Duan, Haodong and Wang, Jiaqi and Qiao, Yu and Lin, Dahua and others},
journal={arXiv preprint arXiv:2403.20330},
year={2024}
}
```
# MMStar(我们是否走在了正确评估大视觉语言模型的道路上?)
[**🌐 主页**](https://mmstar-benchmark.github.io/) | [**🤗 数据集**](https://huggingface.co/datasets/Lin-Chen/MMStar) | [**🤗 论文**](https://huggingface.co/papers/2403.20330) | [**📖 arXiv**](https://arxiv.org/pdf/2403.20330.pdf) | [**GitHub**](https://github.com/MMStar-Benchmark/MMStar)
## 数据集详情
如下图所示,现有基准缺乏对评估样本视觉依赖性以及大语言模型(LLM)、大视觉语言模型(LVLM)训练数据中潜在数据泄露问题的考量。
<p align="center">
<img src="https://raw.githubusercontent.com/MMStar-Benchmark/MMStar/main/resources/4_case_in_1.png" width="80%"> <br>
</p>
为此,我们推出MMStar:一款精选的视觉不可替代型多模态基准,旨在确保每一个精选样本均具备**视觉依赖性**、**极低的数据泄露风险**,且**需要运用高级多模态能力**方可完成解答。
🎯 **我们已发布全套共1500个离线评测样本。** 经过粗筛选流程与人工审核,我们从总计22401个样本中筛选出11607个候选样本,最终精选1500个高质量样本构建MMStar基准。
<p align="center">
<img src="https://raw.githubusercontent.com/MMStar-Benchmark/MMStar/main/resources/data_source.png" width="80%"> <br>
</p>
在MMStar中,内环展示**6项核心能力**,外环呈现**18个细分维度**,中环则标注各细分维度对应的样本数量。每项核心能力均包含经过精心均衡配置的250个样本,我们同时确保18个细分维度的样本分布相对均匀。
<p align="center">
<img src="https://raw.githubusercontent.com/MMStar-Benchmark/MMStar/main/resources/mmstar.png" width="60%"> <br>
</p>
## 🏆 迷你评测排行榜
我们在此展示迷你评测排行榜,更多详细信息请参阅我们的论文或[主页](https://mmstar-benchmark.github.io/)。
| 模型 | 准确率(Acc.) | MG ⬆ | ML ⬇ |
|----------------------------|:---------:|:------------:|:------------:|
| GPT-4V(高精度模式)| **57.1** | **43.6** | 1.3 |
| InternLM-Xcomposer2| 55.4 | 28.1 | 7.5|
| LLaVA-Next-34B |52.1|29.4|2.4|
|GPT-4V(低精度模式)|46.1|32.6|1.3|
|InternVL-Chat-v1.2|43.7|32.6|**0.0**|`
|GeminiPro-Vision|42.6|27.4|**0.0**|`
|Sphinx-X-MoE|38.9|14.8|1.0|
|Monkey-Chat|38.3|13.5|17.6|
|Yi-VL-6B|37.9|15.6|**0.0**|`
|Qwen-VL-Chat|37.5|23.9|**0.0**|`
|Deepseek-VL-7B|37.1|15.7|**0.0**|`
|CogVLM-Chat|36.5|14.9|**0.0**|`
|Yi-VL-34B|36.1|18.8|**0.0**|`
|TinyLLaVA|36.0|16.4|7.6|
|ShareGPT4V-7B|33.0|11.9|**0.0**|`
|LLaVA-1.5-13B|32.8|13.9|**0.0**|`
|LLaVA-1.5-7B|30.3|10.7|**0.0**|`
|随机猜测|24.6|-|-|
## 📧 联系方式
- [陈林(Lin Chen)](https://lin-chen.site/): chlin@mail.ustc.edu.cn
- [李劲松(Jinsong Li)](https://li-jinsong.github.io/): lijingsong@pjlab.org.cn
## ✒️ 引用
如果您的研究中用到了我们的工作,请不吝点亮⭐并引用我们的论文📝
bibtex
@article{chen2024we,
title={Are We on the Right Way for Evaluating Large Vision-Language Models?},
author={Chen, Lin and Li, Jinsong and Dong, Xiaoyi and Zhang, Pan and Zang, Yuhang and Chen, Zehui and Duan, Haodong and Wang, Jiaqi and Qiao, Yu and Lin, Dahua and others},
journal={arXiv preprint arXiv:2403.20330},
year={2024}
}
提供机构:
maas
创建时间:
2025-09-12



