MathVision
收藏魔搭社区2026-05-15 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/evalscope/MathVision
下载链接
链接失效反馈官方服务:
资源简介:
# Measuring Multimodal Mathematical Reasoning with the MATH-Vision Dataset
[[💻 Github](https://github.com/mathllm/MATH-V/)] [[🌐 Homepage](https://mathllm.github.io/mathvision/)] [[📊 Main Leaderboard ](https://mathllm.github.io/mathvision/#leaderboard)] [[📊 Open Source Leaderboard ](https://mathllm.github.io/mathvision/#openleaderboard)] [[🌿 Wild Leaderboard ](https://mathllm.github.io/mathvision/#wildleaderboard)] [[🔍 Visualization](https://mathllm.github.io/mathvision/#visualization)] [[📖 Paper](https://proceedings.neurips.cc/paper_files/paper/2024/file/ad0edc7d5fa1a783f063646968b7315b-Paper-Datasets_and_Benchmarks_Track.pdf)]
---
## 🌿 NEW: MATH-Vision-Wild
**MATH-Vision-Wild** is a photographic, real-world variant of MATH-Vision. The same testmini problems are **physically captured** on printed paper, iPads, laptops, and projectors under varying lighting and angles — the conditions VLMs actually face when a user holds up a phone to a math problem.
📦 **Dataset**: [MathLLMs/MathVision-Wild](https://huggingface.co/datasets/MathLLMs/MathVision-Wild) · 🏆 **Leaderboard**: [mathllm.github.io/mathvision/#wildleaderboard](https://mathllm.github.io/mathvision/#wildleaderboard)
**Key finding — almost every model *regresses* in the wild:**
| Model | MATH-Vision (testmini) | MATH-Vision-Wild | Δ |
|---|---:|---:|---:|
| **o4-mini** 🥇 | 55.9 | **57.2** | **+2.33%** (only model to improve) |
| Gemini 2.5 Pro Preview 05-06 (thinking) | 63.8 | 49.0 | **−23.20%** |
| Gemini 2.5 Flash Preview 05-20 | 57.9 | 48.0 | −17.10% |
| Doubao-1.5-thinking-vision-pro | 57.9 | 45.7 | −21.07% |
| Gemini 2.5 Pro Preview 05-06 | 61.8 | 42.8 | −30.74% |
| GPT-4.1 | 40.5 | 35.5 | −12.35% |
| Qwen2.5-VL-72B-Instruct | 36.2 | 24.0 | −33.70% |
| Gemini 2.0 Flash | 48.0 | 23.0 | −52.08% |
| Gemini 1.5 Pro | 38.8 | 18.4 | −52.58% |
Only **o4-mini** improves when problems are photographed; long-reasoning models degrade less than fast/non-thinking models. A full 25-model comparison with MATH-Vision-Screenshot and Δ% columns is on the [Wild Leaderboard](https://mathllm.github.io/mathvision/#wildleaderboard).
```python
from datasets import load_dataset
wild = load_dataset("MathLLMs/MathVision-Wild", split="testmini_photo")
screenshot = load_dataset("MathLLMs/MathVision-Wild", split="testmini_screenshot")
photo_full = load_dataset("MathLLMs/MathVision-Wild", split="test_photo") # 3,040 photos
```
---
## 🚀 Data Usage
<!-- **We have observed that some studies have used our MATH-Vision dataset as a training set.**
⚠️ **As clearly stated in our paper: *"The MATH-V dataset is not supposed, though the risk exists, to be used to train models for cheating. We intend for researchers to use this dataset to better evaluate LMMs’ mathematical reasoning capabilities and consequently facilitate future studies in this area."***
⚠️⚠️⚠️ **In the very rare situation that there is a compelling reason to include MATH-V in your training set, we strongly urge that the ***testmini*** subset be excluded from the training process!**
-->
```python
from datasets import load_dataset
dataset = load_dataset("MathLLMs/MathVision")
print(dataset)
```
## 🙏 Acknowledgments
We would like to thank the following contributors for helping improve the dataset quality:
- [@Zhiqi Huang](https://huggingface.co/googlebrain) for correcting answers for ID 21
- [@Big-Brother-Pikachu](https://github.com/Big-Brother-Pikachu) for correcting answers for ID 338 and ID 1826
## 💥 News
- **[2026.04.24]** 🌿🔥🔥 **MATH-Vision-Wild released!** A photographic variant that re-captures MATH-Vision problems on printed paper, iPads, laptops, and projectors. **o4-mini** is the **only** model whose accuracy *improves* in the wild (**57.2%**, +2.33% vs baseline); every other VLM — including Gemini 2.5 Pro, GPT-4.1, and Qwen2.5-VL-72B — regresses, often by 20-50%. Dataset: [MathLLMs/MathVision-Wild](https://huggingface.co/datasets/MathLLMs/MathVision-Wild) · Leaderboard: [#wildleaderboard](https://mathllm.github.io/mathvision/#wildleaderboard)
- **[2025.05.16]** 💥 We now support the official open-source leaderboard! 🔥🔥🔥 [**Skywork-R1V2-38B**](https://github.com/SkyworkAI/Skywork-R1V) is the best open-source model, scoring **49.7%** on MATH-Vision. 🔥🔥🔥 [**MathCoder-VL-2B**](https://huggingface.co/MathLLMs/MathCoder-VL-2B) is the best small model on MATH-Vision, scoring **21.7%**. See the full [open-source leaderboard](https://mathllm.github.io/mathvision/#openleaderboard).
- **[2025.05.16]** 🤗 [MathCoder-VL-2B](https://huggingface.co/MathLLMs/MathCoder-VL-2B), [MathCoder-VL-8B](https://huggingface.co/MathLLMs/MathCoder-VL-8B) and [FigCodifier-8B](https://huggingface.co/MathLLMs/FigCodifier) is available now! 🔥🔥🔥
- **[2025.05.16]** Our MathCoder-VL is accepted to ACL 2025. 🔥🔥🔥
- **[2025.05.13]** 🔥🔥🔥 **[Seed1.5-VL](https://github.com/ByteDance-Seed/Seed1.5-VL)** achieves **68.7%** on MATH-Vision! 🎉 Congratulations!
- **[2025.04.11]** 💥 **Kimi-VL-A3B-Thinking achieves strong multimodal reasoning with just 2.8B LLM activated parameters!** Congratulations! See the full [leaderboard](https://mathllm.github.io/mathvision/#leaderboard).
- **[2025.04.10]** 🔥 **SenseNova V6 Reasoner** achieves **55.39%** on MATH-Vision! 🎉 Congratulations!
- **[2025.04.05]** 💥 **Step R1-V-Mini 🥇 Sets New SOTA on MATH-V with 56.6%!** See the full [leaderboard](https://mathllm.github.io/mathvision/#leaderboard).
- **[2025.03.10]** 💥 **Kimi k1.6 Preview Sets New SOTA on MATH-V with 53.29%!** See the full [leaderboard](https://mathllm.github.io/mathvision/#leaderboard).
- **[2025.02.28]** 💥 **Doubao-1.5-pro Sets New SOTA on MATH-V with 48.62%!** Read more on the [Doubao blog](https://team.doubao.com/zh/special/doubao_1_5_pro).
- **[2025.01.26]** 🚀 [Qwen2.5-VL-72B](http://qwenlm.github.io/blog/qwen2.5-vl/) achieves **38.1%**, establishing itself as the best-performing one in open-sourced models. 🎉 Congratulations!
- **[2025.01.22]** 💥 **Kimi k1.5 achieves new SOTA** on MATH-Vision with **38.6%**! Learn more at the [Kimi k1.5 report](https://arxiv.org/pdf/2501.12599).
- **[2024-09-27]** **MATH-V** is accepted by NeurIPS DB Track, 2024! 🎉🎉🎉
- **[2024-08-29]** 🔥 Qwen2-VL-72B achieves new open-sourced SOTA on MATH-Vision with 25.9! 🎉 Congratulations! Learn more at the [Qwen2-VL blog](https://qwenlm.github.io/blog/qwen2-vl/).
- **[2024-07-19]** [open-compass/VLMEvalKit](https://github.com/open-compass/VLMEvalKit) now supports **MATH-V**, utilizing LLMs for more accurate answer extraction!🔥
- **[2024-05-19]** OpenAI's **GPT-4o** scores **30.39%** on **MATH-V**, considerable advancement in short time! 💥
- **[2024-03-01]** **InternVL-Chat-V1-2-Plus** achieves **16.97%**, establishing itself as the new best-performing open-sourced model. 🎉 Congratulations!
- **[2024-02-23]** Our dataset is now accessible at [huggingface](https://huggingface.co/datasets/MathLLMs/MathVision).
- **[2024-02-22]** The top-performing model, **GPT-4V** only scores **23.98%** on **MATH-V**, while human performance is around **70%**.
- **[2024-02-22]** Our paper is now accessible at [ArXiv Paper](https://arxiv.org/abs/2402.14804).
## 👀 Introduction
Recent advancements in Large Multimodal Models (LMMs) have shown promising results in mathematical reasoning within visual contexts, with models approaching human-level performance on existing benchmarks such as MathVista. However, we observe significant limitations in the diversity of questions and breadth of subjects covered by these benchmarks. To address this issue, we present the MATH-Vision (MATH-V) dataset, a meticulously curated collection of 3,040 high-quality mathematical problems with visual contexts sourced from real math competitions. Spanning 16 distinct mathematical disciplines and graded across 5 levels of difficulty, our dataset provides a comprehensive and diverse set of challenges for evaluating the mathematical reasoning abilities of LMMs.
<p align="center">
<img src="https://raw.githubusercontent.com/mathvision-cuhk/MathVision/main/assets/figures/figure1_new.png" width="66%"> The accuracies of four prominent Large Multimodal Models (LMMs), random chance, and human <br>
performance are evaluated on our proposed <b>MATH-Vision (MATH-V)</b> across 16 subjects.
</p>
<br>
Through extensive experimentation, we unveil a notable performance gap between current LMMs and human performance on MATH-V, underscoring the imperative for further advancements in LMMs.
You can refer to the [project homepage](https://mathvision-cuhk.github.io/) for more details.
## 🏆 Leaderboard
The leaderboard is available [here](https://mathvision-cuhk.github.io/#leaderboard).
We are commmitted to maintain this dataset and leaderboard in the long run to ensure its quality!
🔔 If you find any mistakes, please paste the question_id to the issue page, we will modify it accordingly.
## 📐 Dataset Examples
Some examples of MATH-V on three subjects: analytic geometry, topology, and graph theory.
<details>
<summary>Analytic geometry</summary><p align="center">
<img src="https://raw.githubusercontent.com/mathvision-cuhk/MathVision/main/assets/examples/exam_analytic_geo.png" width="60%"> <br>
</p></details>
<details>
<summary>Topology</summary><p align="center">
<img src="https://raw.githubusercontent.com/mathvision-cuhk/MathVision/main/assets/examples/exam_topology.png" width="60%"> <br>
</p></details>
<details>
<summary>Graph Geometry</summary><p align="center">
<img src="https://raw.githubusercontent.com/mathvision-cuhk/MathVision/main/assets/examples/exam_graph.png" width="60%"> <br>
</p></details>
## 📑 Citation
If you find this benchmark useful in your research, please consider citing this BibTex:
```
@inproceedings{
wang2024measuring,
title={Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset},
author={Ke Wang and Junting Pan and Weikang Shi and Zimu Lu and Houxing Ren and Aojun Zhou and Mingjie Zhan and Hongsheng Li},
booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
year={2024},
url={https://openreview.net/forum?id=QWTCcxMpPA}
}
@inproceedings{
wang2025mathcodervl,
title={MathCoder-{VL}: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning},
author={Ke Wang and Junting Pan and Linda Wei and Aojun Zhou and Weikang Shi and Zimu Lu and Han Xiao and Yunqiao Yang and Houxing Ren and Mingjie Zhan and Hongsheng Li},
booktitle={The 63rd Annual Meeting of the Association for Computational Linguistics},
year={2025},
url={https://openreview.net/forum?id=nuvtX1imAb}
}
```
# 基于MATH-Vision数据集的多模态数学推理能力评测
[[💻 Github](https://github.com/mathllm/MATH-V/)] [[🌐 项目主页](https://mathllm.github.io/mathvision/)] [[📊 官方排行榜](https://mathllm.github.io/mathvision/#leaderboard)] [[📊 开源模型排行榜](https://mathllm.github.io/mathvision/#openleaderboard)] [[🔍 结果可视化](https://mathllm.github.io/mathvision/#visualization)] [[📖 论文](https://proceedings.neurips.cc/paper_files/paper/2024/file/ad0edc7d5fa1a783f063646968b7315b-Paper-Datasets_and_Benchmarks_Track.pdf)]
## 🚀 数据使用
<!-- **我们注意到已有部分研究将MATH-Vision数据集用作训练集。**
⚠️ **正如我们在论文中明确说明的:*"尽管存在相关风险,但MATH-V数据集不应被用于训练模型以进行作弊。我们希望研究人员使用该数据集更好地评估多模态大模型(Large Multimodal Models,LMMs)的数学推理能力,进而推动该领域未来的研究。"***
⚠️⚠️⚠️ **若存在特殊且合理的原因需要将MATH-V纳入训练集,我们强烈建议在训练过程中剔除***testmini***子集!**
-->
python
from datasets import load_dataset
dataset = load_dataset("MathLLMs/MathVision")
print(dataset)
## 🙏 致谢
我们感谢以下贡献者为提升数据集质量所提供的帮助:
- [@Zhiqi Huang](https://huggingface.co/googlebrain):修正了ID为21的题目答案
- [@Big-Brother-Pikachu](https://github.com/Big-Brother-Pikachu):修正了ID为338和1826的题目答案
## 💥 最新动态
- **[2025.05.16]** 💥 我们现已推出官方开源模型排行榜!🔥🔥🔥 [**Skywork-R1V2-38B**](https://github.com/SkyworkAI/Skywork-R1V) 为当前最佳开源模型,在MATH-Vision上的准确率达**49.7%**。🔥🔥🔥 [**MathCoder-VL-2B**](https://huggingface.co/MathLLMs/MathCoder-VL-2B) 为当前最佳小型开源模型,准确率达**21.7%**。详见完整[开源模型排行榜](https://mathllm.github.io/mathvision/#openleaderboard)。
- **[2025.05.16]** 🤗 [MathCoder-VL-2B](https://huggingface.co/MathLLMs/MathCoder-VL-2B)、[MathCoder-VL-8B](https://huggingface.co/MathLLMs/MathCoder-VL-8B) 与 [FigCodifier-8B](https://huggingface.co/MathLLMs/FigCodifier) 现已正式发布!🔥🔥🔥
- **[2025.05.16]** 我们的MathCoder-VL模型已被ACL 2025接收!🔥🔥🔥
- **[2025.05.13]** 🔥🔥🔥 **[Seed1.5-VL](https://github.com/ByteDance-Seed/Seed1.5-VL)** 在MATH-Vision上取得**68.7%**的准确率!🎉 恭喜!
- **[2025.04.11]** 💥 **Kimi-VL-A3B-Thinking 仅通过激活28亿参数的大语言模型,就实现了出色的多模态推理能力!** 恭喜!详见完整[官方排行榜](https://mathllm.github.io/mathvision/#leaderboard)。
- **[2025.04.10]** 🔥 **SenseNova V6 Reasoner** 在MATH-Vision上取得**55.39%**的准确率!🎉 恭喜!
- **[2025.04.05]** 💥 **Step R1-V-Mini 🥇 以56.6%的准确率刷新MATH-V的当前最优性能!** 详见完整[官方排行榜](https://mathllm.github.io/mathvision/#leaderboard)。
- **[2025.03.10]** 💥 **Kimi k1.6 Preview 以53.29%的准确率刷新MATH-V的当前最优性能!** 详见完整[官方排行榜](https://mathllm.github.io/mathvision/#leaderboard)。
- **[2025.02.28]** 💥 **Doubao-1.5-pro 以48.62%的准确率刷新MATH-V的当前最优性能!** 详见[Doubao官方博客](https://team.doubao.com/zh/special/doubao_1_5_pro)。
- **[2025.01.26]** 🚀 [Qwen2.5-VL-72B](http://qwenlm.github.io/blog/qwen2.5-vl/) 取得**38.1%**的准确率,成为当前性能最优的开源模型。🎉 恭喜!
- **[2025.01.22]** 💥 **Kimi k1.5 以38.6%的准确率刷新MATH-Vision的当前最优性能!** 详见[Kimi k1.5技术报告](https://arxiv.org/pdf/2501.12599)。
- **[2024-09-27]** **MATH-V** 数据集被NeurIPS 2024数据集与基准跟踪赛道接收!🎉🎉🎉
- **[2024-08-29]** 🔥 Qwen2-VL-72B 以25.9%的准确率刷新MATH-Vision的开源模型最优性能!🎉 恭喜!详见[Qwen2-VL官方博客](https://qwenlm.github.io/blog/qwen2-vl/)。
- **[2024-07-19]** [open-compass/VLMEvalKit](https://github.com/open-compass/VLMEvalKit) 现已支持**MATH-V**数据集,可借助大语言模型实现更精准的答案抽取!🔥
- **[2024-05-19]** OpenAI的**GPT-4o**在**MATH-V**上取得**30.39%**的准确率,短时间内实现了显著提升!💥
- **[2024-03-01]** **InternVL-Chat-V1-2-Plus** 取得**16.97%**的准确率,成为当前性能最优的开源模型。🎉 恭喜!
- **[2024-02-23]** 本数据集现已在[Hugging Face平台](https://huggingface.co/datasets/MathLLMs/MathVision)开放获取。
- **[2024-02-22]** 当前性能最优的模型**GPT-4V**在**MATH-V**上仅取得**23.98%**的准确率,而人类的平均准确率约为**70%**。
- **[2024-02-22]** 我们的论文现已在[ArXiv平台](https://arxiv.org/abs/2402.14804)开放获取。
## 👀 数据集介绍
近年来,多模态大模型(Large Multimodal Models,LMMs)在视觉场景下的数学推理任务中取得了可观进展,部分模型在MathVista等现有基准测试上的表现已接近人类水平。但我们发现,现有基准测试在题目多样性与学科覆盖广度上仍存在显著局限。为解决这一问题,我们推出MATH-Vision(简称MATH-V)数据集,该数据集精心收集了3040道源自真实数学竞赛的高质量视觉场景数学题。数据集涵盖16个不同的数学学科,并分为5个难度等级,能够为评测多模态大模型的数学推理能力提供全面且多样的挑战场景。
<p align="center">
<img src="https://raw.githubusercontent.com/mathvision-cuhk/MathVision/main/assets/figures/figure1_new.png" width="66%"> 四款主流多模态大模型、随机猜测与人类在我们提出的<b>MATH-Vision(MATH-V)</b>数据集16个学科上的准确率对比。
</p>
<br>
通过大量实验,我们发现当前多模态大模型与人类在MATH-V上的性能仍存在显著差距,这也凸显了进一步优化多模态大模型的必要性。
您可访问[项目主页](https://mathvision-cuhk.github.io/)了解更多细节。
## 🏆 排行榜
官方排行榜详见[此处](https://mathvision-cuhk.github.io/#leaderboard)。
我们将长期维护本数据集与排行榜,以确保其质量!
🔔 若您发现任何错误,请将问题ID提交至Issue页面,我们将及时修正。
## 📐 数据集示例
以下为MATH-V在解析几何、拓扑学与图论三个学科上的示例:
<details>
<summary>解析几何</summary><p align="center">
<img src="https://raw.githubusercontent.com/mathvision-cuhk/MathVision/main/assets/examples/exam_analytic_geo.png" width="60%"> <br>
</p></details>
<details>
<summary>拓扑学</summary><p align="center">
<img src="https://raw.githubusercontent.com/mathvision-cuhk/MathVision/main/assets/examples/exam_topology.png" width="60%"> <br>
</p></details>
<details>
<summary>图论</summary><p align="center">
<img src="https://raw.githubusercontent.com/mathvision-cuhk/MathVision/main/assets/examples/exam_graph.png" width="60%"> <br>
</p></details>
## 📑 论文引用
若您在研究中使用了本基准测试,请引用以下BibTex格式的文献:
@inproceedings{
wang2024measuring,
title={Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset},
author={Ke Wang and Junting Pan and Weikang Shi and Zimu Lu and Houxing Ren and Aojun Zhou and Mingjie Zhan and Hongsheng Li},
booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
year={2024},
url={https://openreview.net/forum?id=QWTCcxMpPA}
}
@inproceedings{
wang2025mathcodervl,
title={MathCoder-{VL}: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning},
author={Ke Wang and Junting Pan and Linda Wei and Aojun Zhou and Weikang Shi and Zimu Lu and Han Xiao and Yunqiao Yang and Houxing Ren and Mingjie Zhan and Hongsheng Li},
booktitle={The 63rd Annual Meeting of the Association for Computational Linguistics},
year={2025},
url={https://openreview.net/forum?id=nuvtX1imAb}
}
提供机构:
maas
创建时间:
2025-10-13



