five

MMStar

收藏
魔搭社区2026-05-16 更新2025-09-20 收录
下载链接:
https://modelscope.cn/datasets/evalscope/MMStar
下载链接
链接失效反馈
官方服务:
资源简介:
# MMStar (Are We on the Right Way for Evaluating Large Vision-Language Models?) [**🌐 Homepage**](https://mmstar-benchmark.github.io/) | [**🤗 Dataset**](https://huggingface.co/datasets/Lin-Chen/MMStar) | [**🤗 Paper**](https://huggingface.co/papers/2403.20330) | [**📖 arXiv**](https://arxiv.org/pdf/2403.20330.pdf) | [**GitHub**](https://github.com/MMStar-Benchmark/MMStar) ## Dataset Details As shown in the figure below, existing benchmarks lack consideration of the vision dependency of evaluation samples and potential data leakage from LLMs' and LVLMs' training data. <p align="center"> <img src="https://raw.githubusercontent.com/MMStar-Benchmark/MMStar/main/resources/4_case_in_1.png" width="80%"> <br> </p> Therefore, we introduce MMStar: an elite vision-indispensible multi-modal benchmark, aiming to ensure each curated sample exhibits **visual dependency**, **minimal data leakage**, and **requires advanced multi-modal capabilities**. 🎯 **We have released a full set comprising 1500 offline-evaluating samples.** After applying the coarse filter process and manual review, we narrow down from a total of 22,401 samples to 11,607 candidate samples and finally select 1,500 high-quality samples to construct our MMStar benchmark. <p align="center"> <img src="https://raw.githubusercontent.com/MMStar-Benchmark/MMStar/main/resources/data_source.png" width="80%"> <br> </p> In MMStar, we display **6 core capabilities** in the inner ring, with **18 detailed axes** presented in the outer ring. The middle ring showcases the number of samples for each detailed dimension. Each core capability contains a meticulously **balanced 250 samples**. We further ensure a relatively even distribution across the 18 detailed axes. <p align="center"> <img src="https://raw.githubusercontent.com/MMStar-Benchmark/MMStar/main/resources/mmstar.png" width="60%"> <br> </p> ## 🏆 Mini-Leaderboard We show a mini-leaderboard here and please find more information in our paper or [homepage](https://mmstar-benchmark.github.io/). | Model | Acc. | MG ⬆ | ML ⬇ | |----------------------------|:---------:|:------------:|:------------:| | GPT4V (high)| **57.1** | **43.6** | 1.3 | | InternLM-Xcomposer2| 55.4 | 28.1 | 7.5| | LLaVA-Next-34B |52.1|29.4|2.4| |GPT4V (low)|46.1|32.6|1.3| |InternVL-Chat-v1.2|43.7|32.6|**0.0**| |GeminiPro-Vision|42.6|27.4|**0.0**| |Sphinx-X-MoE|38.9|14.8|1.0| |Monkey-Chat|38.3|13.5|17.6| |Yi-VL-6B|37.9|15.6|**0.0**| |Qwen-VL-Chat|37.5|23.9|**0.0**| |Deepseek-VL-7B|37.1|15.7|**0.0**| |CogVLM-Chat|36.5|14.9|**0.0**| |Yi-VL-34B|36.1|18.8|**0.0**| |TinyLLaVA|36.0|16.4|7.6| |ShareGPT4V-7B|33.0|11.9|**0.0**| |LLaVA-1.5-13B|32.8|13.9|**0.0**| |LLaVA-1.5-7B|30.3|10.7|**0.0**| |Random Choice|24.6|-|-| ## 📧 Contact - [Lin Chen](https://lin-chen.site/): chlin@mail.ustc.edu.cn - [Jinsong Li](https://li-jinsong.github.io/): lijingsong@pjlab.org.cn ## ✒️ Citation If you find our work helpful for your research, please consider giving a star ⭐ and citation 📝 ```bibtex @article{chen2024we, title={Are We on the Right Way for Evaluating Large Vision-Language Models?}, author={Chen, Lin and Li, Jinsong and Dong, Xiaoyi and Zhang, Pan and Zang, Yuhang and Chen, Zehui and Duan, Haodong and Wang, Jiaqi and Qiao, Yu and Lin, Dahua and others}, journal={arXiv preprint arXiv:2403.20330}, year={2024} } ```

# MMStar(我们是否走在了正确评估大视觉语言模型的道路上?) [**🌐 主页**](https://mmstar-benchmark.github.io/) | [**🤗 数据集**](https://huggingface.co/datasets/Lin-Chen/MMStar) | [**🤗 论文**](https://huggingface.co/papers/2403.20330) | [**📖 arXiv**](https://arxiv.org/pdf/2403.20330.pdf) | [**GitHub**](https://github.com/MMStar-Benchmark/MMStar) ## 数据集详情 如下图所示,现有基准缺乏对评估样本视觉依赖性以及大语言模型(LLM)、大视觉语言模型(LVLM)训练数据中潜在数据泄露问题的考量。 <p align="center"> <img src="https://raw.githubusercontent.com/MMStar-Benchmark/MMStar/main/resources/4_case_in_1.png" width="80%"> <br> </p> 为此,我们推出MMStar:一款精选的视觉不可替代型多模态基准,旨在确保每一个精选样本均具备**视觉依赖性**、**极低的数据泄露风险**,且**需要运用高级多模态能力**方可完成解答。 🎯 **我们已发布全套共1500个离线评测样本。** 经过粗筛选流程与人工审核,我们从总计22401个样本中筛选出11607个候选样本,最终精选1500个高质量样本构建MMStar基准。 <p align="center"> <img src="https://raw.githubusercontent.com/MMStar-Benchmark/MMStar/main/resources/data_source.png" width="80%"> <br> </p> 在MMStar中,内环展示**6项核心能力**,外环呈现**18个细分维度**,中环则标注各细分维度对应的样本数量。每项核心能力均包含经过精心均衡配置的250个样本,我们同时确保18个细分维度的样本分布相对均匀。 <p align="center"> <img src="https://raw.githubusercontent.com/MMStar-Benchmark/MMStar/main/resources/mmstar.png" width="60%"> <br> </p> ## 🏆 迷你评测排行榜 我们在此展示迷你评测排行榜,更多详细信息请参阅我们的论文或[主页](https://mmstar-benchmark.github.io/)。 | 模型 | 准确率(Acc.) | MG ⬆ | ML ⬇ | |----------------------------|:---------:|:------------:|:------------:| | GPT-4V(高精度模式)| **57.1** | **43.6** | 1.3 | | InternLM-Xcomposer2| 55.4 | 28.1 | 7.5| | LLaVA-Next-34B |52.1|29.4|2.4| |GPT-4V(低精度模式)|46.1|32.6|1.3| |InternVL-Chat-v1.2|43.7|32.6|**0.0**|` |GeminiPro-Vision|42.6|27.4|**0.0**|` |Sphinx-X-MoE|38.9|14.8|1.0| |Monkey-Chat|38.3|13.5|17.6| |Yi-VL-6B|37.9|15.6|**0.0**|` |Qwen-VL-Chat|37.5|23.9|**0.0**|` |Deepseek-VL-7B|37.1|15.7|**0.0**|` |CogVLM-Chat|36.5|14.9|**0.0**|` |Yi-VL-34B|36.1|18.8|**0.0**|` |TinyLLaVA|36.0|16.4|7.6| |ShareGPT4V-7B|33.0|11.9|**0.0**|` |LLaVA-1.5-13B|32.8|13.9|**0.0**|` |LLaVA-1.5-7B|30.3|10.7|**0.0**|` |随机猜测|24.6|-|-| ## 📧 联系方式 - [陈林(Lin Chen)](https://lin-chen.site/): chlin@mail.ustc.edu.cn - [李劲松(Jinsong Li)](https://li-jinsong.github.io/): lijingsong@pjlab.org.cn ## ✒️ 引用 如果您的研究中用到了我们的工作,请不吝点亮⭐并引用我们的论文📝 bibtex @article{chen2024we, title={Are We on the Right Way for Evaluating Large Vision-Language Models?}, author={Chen, Lin and Li, Jinsong and Dong, Xiaoyi and Zhang, Pan and Zang, Yuhang and Chen, Zehui and Duan, Haodong and Wang, Jiaqi and Qiao, Yu and Lin, Dahua and others}, journal={arXiv preprint arXiv:2403.20330}, year={2024} }
提供机构:
maas
创建时间:
2025-09-12
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作