MathVerse
收藏魔搭社区2026-05-06 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/evalscope/MathVerse
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for MathVerse
- [Dataset Description](https://huggingface.co/datasets/AI4Math/MathVerse/blob/main/README.md#dataset-description)
- [Paper Information](https://huggingface.co/datasets/AI4Math/MathVerse/blob/main/README.md#paper-information)
- [Dataset Examples](https://huggingface.co/datasets/AI4Math/MathVerse/blob/main/README.md#dataset-examples)
- [Leaderboard](https://huggingface.co/datasets/AI4Math/MathVerse/blob/main/README.md#leaderboard)
- [Citation](https://huggingface.co/datasets/AI4Math/MathVerse/blob/main/README.md#citation)
## Dataset Description
The capabilities of **Multi-modal Large Language Models (MLLMs)** in **visual math problem-solving** remain insufficiently evaluated and understood. We investigate current benchmarks to incorporate excessive visual content within textual questions, which potentially assist MLLMs in deducing answers without truly interpreting the input diagrams.
<p align="center">
<img src="https://raw.githubusercontent.com/ZrrSkywalker/MathVerse/main/figs/fig1.png" width="90%"> <br>
</p>
To this end, we introduce **MathVerse**, an all-around visual math benchmark designed for an equitable and in-depth evaluation of MLLMs. We meticulously collect 2,612 high-quality, multi-subject math problems with diagrams from publicly available sources. Each problem is then transformed by human annotators into **six distinct versions**, each offering varying degrees of information content in multi-modality, contributing to **15K** test samples in total. This approach allows MathVerse to comprehensively assess ***whether and how much MLLMs can truly understand the visual diagrams for mathematical reasoning.***
<p align="center">
<img src="https://raw.githubusercontent.com/ZrrSkywalker/MathVerse/main/figs/fig2.png" width="90%"> <br>
Six different versions of each problem in <b>MathVerse</b> transformed by expert annotators.
</p>
In addition, we propose a **Chain-of-Thought (CoT) Evaluation strategy** for a fine-grained assessment of the output answers. Rather than naively judging True or False, we employ GPT-4(V) to adaptively extract crucial reasoning steps, and then score each step with detailed error analysis, which can reveal the intermediate CoT reasoning quality by MLLMs.
<p align="center">
<img src="https://raw.githubusercontent.com/ZrrSkywalker/MathVerse/main/figs/fig3.png" width="90%"> <br>
The two phases of the CoT evaluation strategy.
</p>
## Paper Information
- Code: https://github.com/ZrrSkywalker/MathVerse
- Project: https://mathverse-cuhk.github.io/
- Visualization: https://mathverse-cuhk.github.io/#visualization
- Leaderboard: https://mathverse-cuhk.github.io/#leaderboard
- Paper: https://arxiv.org/abs/2403.14624
## Dataset Examples
🖱 Click to expand the examples for six problems versions within three subjects</summary>
<details>
<summary>🔍 Plane Geometry</summary>
<p align="center">
<img src="https://raw.githubusercontent.com/ZrrSkywalker/MathVerse/main/figs/ver1.png" width="50%"> <br>
</p>
</details>
<details>
<summary>🔍 Solid Geometry</summary>
<p align="center">
<img src="https://raw.githubusercontent.com/ZrrSkywalker/MathVerse/main/figs/ver2.png" width="50%"> <br>
</p>
</details>
<details>
<summary>🔍 Functions</summary>
<p align="center">
<img src="https://raw.githubusercontent.com/ZrrSkywalker/MathVerse/main/figs/ver3.png" width="50%"> <br>
</p>
</details>
## Leaderboard
### Contributing to the Leaderboard
🚨 The [Leaderboard](https://mathverse-cuhk.github.io/#leaderboard) is continuously being updated.
The evaluation instructions and tools will be released soon. For now, please send your results on the ***testmini*** set to this email: 1700012927@pku.edu.cn. Please refer to the following template to prepare your result json file.
- [output_testmini_template.json]()
## License
This project is released under the MIT license.
## Citation
If you find **MathVerse** useful for your research and applications, please kindly cite using this BibTeX:
```latex
@inproceedings{zhang2024mathverse,
title={MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?},
author={Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, Hongsheng Li},
booktitle={arXiv},
year={2024}
}
```
# MathVerse 数据集卡片
- [数据集说明](https://huggingface.co/datasets/AI4Math/MathVerse/blob/main/README.md#dataset-description)
- [论文信息](https://huggingface.co/datasets/AI4Math/MathVerse/blob/main/README.md#paper-information)
- [数据集示例](https://huggingface.co/datasets/AI4Math/MathVerse/blob/main/README.md#dataset-examples)
- [排行榜](https://huggingface.co/datasets/AI4Math/MathVerse/blob/main/README.md#leaderboard)
- [引用方式](https://huggingface.co/datasets/AI4Math/MathVerse/blob/main/README.md#citation)
## 数据集说明
当前**多模态大语言模型(Multi-modal Large Language Models, MLLMs)**在**视觉数学解题**领域的能力仍未得到充分评估与认知。我们调研发现,现有基准测试往往在文本问题中嵌入过多视觉内容,这可能使得模型无需真正理解输入图表即可推导答案。
<p align="center">
<img src="https://raw.githubusercontent.com/ZrrSkywalker/MathVerse/main/figs/fig1.png" width="90%"> <br>
</p>
为此,我们推出**MathVerse**——一款面向公平且深入评估MLLMs的全方位视觉数学基准测试集。我们从公开渠道精心收集了2612道高质量、多学科的带图表数学题目。随后由人工标注员将每道题目转化为**6个不同版本**,每个版本在多模态信息含量上各有差异,最终共计生成15000个测试样本。该设计使得MathVerse能够全面评估**MLLMs能否以及在多大程度上真正理解用于数学推理的视觉图表**。
<p align="center">
<img src="https://raw.githubusercontent.com/ZrrSkywalker/MathVerse/main/figs/fig2.png" width="90%"> <br>
专家标注员转化得到的<b>MathVerse</b>中每道题目的6种不同版本。
</p>
此外,我们提出了**思维链(Chain-of-Thought, CoT)评估策略**,以对模型输出答案进行细粒度评估。相较于简单地判断对错,我们采用GPT-4(V)自适应提取关键推理步骤,随后对每个步骤进行评分并开展详细的错误分析,从而揭示模型生成的中间思维链推理质量。
<p align="center">
<img src="https://raw.githubusercontent.com/ZrrSkywalker/MathVerse/main/figs/fig3.png" width="90%"> <br>
CoT评估策略的两个阶段。
</p>
## 论文信息
- 代码:https://github.com/ZrrSkywalker/MathVerse
- 项目主页:https://mathverse-cuhk.github.io/
- 可视化页面:https://mathverse-cuhk.github.io/#visualization
- 排行榜页面:https://mathverse-cuhk.github.io/#leaderboard
- 论文:https://arxiv.org/abs/2403.14624
## 数据集示例
🖱 点击展开三个学科下共六道题目版本的示例</summary>
<details>
<summary>🔍 平面几何</summary>
<p align="center">
<img src="https://raw.githubusercontent.com/ZrrSkywalker/MathVerse/main/figs/ver1.png" width="50%"> <br>
</p>
</details>
<details>
<summary>🔍 立体几何</summary>
<p align="center">
<img src="https://raw.githubusercontent.com/ZrrSkywalker/MathVerse/main/figs/ver2.png" width="50%"> <br>
</p>
</details>
<details>
<summary>🔍 函数</summary>
<p align="center">
<img src="https://raw.githubusercontent.com/ZrrSkywalker/MathVerse/main/figs/ver3.png" width="50%"> <br>
</p>
</details>
## 排行榜
### 参与排行榜更新
🚨 排行榜页面(https://mathverse-cuhk.github.io/#leaderboard)仍在持续更新中。
评估指南与工具即将发布。目前,请将您在***testmini***测试集上的结果发送至邮箱:1700012927@pku.edu.cn。请参考以下模板准备结果JSON文件。
- [output_testmini_template.json]()
## 许可证
本项目采用MIT许可证开源。
## 引用方式
若您的研究与应用中用到了**MathVerse**,请使用以下BibTeX格式引用:
latex
@inproceedings{zhang2024mathverse,
title={MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?},
author={Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, Hongsheng Li},
booktitle={arXiv},
year={2024}
}
提供机构:
maas
创建时间:
2025-10-13
搜集汇总
数据集介绍

背景与挑战
背景概述
MathVerse是一个多模态视觉数学基准数据集,旨在评估多模态大语言模型在视觉数学问题解决中的真实图表理解能力。数据集包含2,612个高质量数学问题,每个问题被转换为六个不同版本,总计15K测试样本,以提供多模态信息的不同程度。此外,数据集采用链式思维评估策略,通过GPT-4(V)提取和评分推理步骤,实现细粒度性能分析。
以上内容由遇见数据集搜集并总结生成



