CaraJ/MathVerse-lmmseval
收藏Hugging Face2024-04-19 更新2024-04-19 收录
下载链接:
https://hf-mirror.com/datasets/CaraJ/MathVerse-lmmseval
下载链接
链接失效反馈官方服务:
资源简介:
---
task_categories:
- multiple-choice
- question-answering
- visual-question-answering
language:
- en
size_categories:
- 1K<n<10K
configs:
- config_name: testmini
data_files:
- split: testmini
path: "testmini.parquet"
- config_name: testmini_version_split
data_files:
- split: text_lite
path: "testmini_text_lite.parquet"
- split: text_dominant
path: "testmini_text_dominant.parquet"
- split: vision_dominant
path: "testmini_vision_dominant.parquet"
- split: vision_intensive
path: "testmini_vision_intensive.parquet"
- split: vision_only
path: "testmini_vision_only.parquet"
- config_name: testmini_text_only
data_files:
- split: text_only
path: "testmini_text_only.parquet"
dataset_info:
- config_name: testmini
features:
- name: sample_index
dtype: string
- name: problem_index
dtype: string
- name: problem_version
dtype: string
- name: question
dtype: string
- name: image
dtype: image
- name: answer
dtype: string
- name: question_type
dtype: string
- name: metadata
struct:
- name: split
dtype: string
- name: source
dtype: string
- name: subject
dtype: string
- name: subfield
dtype: string
- name: query_wo
dtype: string
- name: query_cot
dtype: string
- name: question_for_eval
dtype: string
splits:
- name: testmini
num_bytes: 166789963
num_examples: 3940
- config_name: testmini_version_split
features:
- name: sample_index
dtype: string
- name: problem_index
dtype: string
- name: problem_version
dtype: string
- name: question
dtype: string
- name: image
dtype: image
- name: answer
dtype: string
- name: question_type
dtype: string
- name: metadata
struct:
- name: split
dtype: string
- name: source
dtype: string
- name: subject
dtype: string
- name: subfield
dtype: string
- name: query_wo
dtype: string
- name: query_cot
dtype: string
- name: question_for_eval
dtype: string
splits:
- name: text_lite
num_examples: 788
- name: text_dominant
num_examples: 788
- name: vision_dominant
num_examples: 788
- name: vision_intensive
num_examples: 788
- name: vision_only
num_examples: 788
- config_name: testmini_text_only
features:
- name: sample_index
dtype: string
- name: problem_index
dtype: string
- name: problem_version
dtype: string
- name: question
dtype: string
- name: image
dtype: string
- name: answer
dtype: string
- name: question_type
dtype: string
- name: metadata
struct:
- name: split
dtype: string
- name: source
dtype: string
- name: subject
dtype: string
- name: subfield
dtype: string
- name: query_wo
dtype: string
- name: query_cot
dtype: string
- name: question_for_eval
dtype: string
splits:
- name: text_only
num_bytes: 250959
num_examples: 788
---
# Dataset Card for MathVerse
This is the version for [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval). This shares the same data with the [official dataset](https://huggingface.co/datasets/AI4Math/MathVerse?row=3).
- [Dataset Description](https://huggingface.co/datasets/AI4Math/MathVerse/blob/main/README.md#dataset-description)
- [Paper Information](https://huggingface.co/datasets/AI4Math/MathVerse/blob/main/README.md#paper-information)
- [Dataset Examples](https://huggingface.co/datasets/AI4Math/MathVerse/blob/main/README.md#dataset-examples)
- [Leaderboard](https://huggingface.co/datasets/AI4Math/MathVerse/blob/main/README.md#leaderboard)
- [Citation](https://huggingface.co/datasets/AI4Math/MathVerse/blob/main/README.md#citation)
## Dataset Description
The capabilities of **Multi-modal Large Language Models (MLLMs)** in **visual math problem-solving** remain insufficiently evaluated and understood. We investigate current benchmarks to incorporate excessive visual content within textual questions, which potentially assist MLLMs in deducing answers without truly interpreting the input diagrams.
<p align="center">
<img src="https://raw.githubusercontent.com/ZrrSkywalker/MathVerse/main/figs/fig1.png" width="90%"> <br>
</p>
To this end, we introduce **MathVerse**, an all-around visual math benchmark designed for an equitable and in-depth evaluation of MLLMs. We meticulously collect 2,612 high-quality, multi-subject math problems with diagrams from publicly available sources. Each problem is then transformed by human annotators into **six distinct versions**, each offering varying degrees of information content in multi-modality, contributing to **15K** test samples in total. This approach allows MathVerse to comprehensively assess ***whether and how much MLLMs can truly understand the visual diagrams for mathematical reasoning.***
<p align="center">
<img src="https://raw.githubusercontent.com/ZrrSkywalker/MathVerse/main/figs/fig2.png" width="90%"> <br>
Six different versions of each problem in <b>MathVerse</b> transformed by expert annotators.
</p>
In addition, we propose a **Chain-of-Thought (CoT) Evaluation strategy** for a fine-grained assessment of the output answers. Rather than naively judging True or False, we employ GPT-4(V) to adaptively extract crucial reasoning steps, and then score each step with detailed error analysis, which can reveal the intermediate CoT reasoning quality by MLLMs.
<p align="center">
<img src="https://raw.githubusercontent.com/ZrrSkywalker/MathVerse/main/figs/fig3.png" width="90%"> <br>
The two phases of the CoT evaluation strategy.
</p>
## Paper Information
- Code: https://github.com/ZrrSkywalker/MathVerse
- Project: https://mathverse-cuhk.github.io/
- Visualization: https://mathverse-cuhk.github.io/#visualization
- Leaderboard: https://mathverse-cuhk.github.io/#leaderboard
- Paper: https://arxiv.org/abs/2403.14624
## Dataset Examples
🖱 Click to expand the examples for six problems versions within three subjects</summary>
<details>
<summary>🔍 Plane Geometry</summary>
<p align="center">
<img src="https://raw.githubusercontent.com/ZrrSkywalker/MathVerse/main/figs/ver1.png" width="50%"> <br>
</p>
</details>
<details>
<summary>🔍 Solid Geometry</summary>
<p align="center">
<img src="https://raw.githubusercontent.com/ZrrSkywalker/MathVerse/main/figs/ver2.png" width="50%"> <br>
</p>
</details>
<details>
<summary>🔍 Functions</summary>
<p align="center">
<img src="https://raw.githubusercontent.com/ZrrSkywalker/MathVerse/main/figs/ver3.png" width="50%"> <br>
</p>
</details>
## Leaderboard
### Contributing to the Leaderboard
🚨 The [Leaderboard](https://mathverse-cuhk.github.io/#leaderboard) is continuously being updated.
The evaluation instructions and tools will be released soon. For now, please send your results on the ***testmini*** set to this email: 1700012927@pku.edu.cn. Please refer to the following template to prepare your result json file.
- [output_testmini_template.json]()
## Citation
If you find **MathVerse** useful for your research and applications, please kindly cite using this BibTeX:
```latex
@inproceedings{zhang2024mathverse,
title={MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?},
author={Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, Hongsheng Li},
booktitle={arXiv},
year={2024}
}
```
提供机构:
CaraJ
原始信息汇总
数据集概述
任务类别
- 多项选择
- 问答
- 视觉问答
语言
- 英语
大小类别
- 1K<n<10K
配置信息
-
config_name: testmini
- 数据文件:
testmini.parquet - 特征:
- sample_index: string
- problem_index: string
- problem_version: string
- question: string
- image: image
- answer: string
- question_type: string
- metadata: struct (split, source, subject, subfield: all string)
- query_wo: string
- query_cot: string
- question_for_eval: string
- 分割: testmini
- num_bytes: 166789963
- num_examples: 3940
- 数据文件:
-
config_name: testmini_version_split
- 数据文件:
- testmini_text_lite.parquet
- testmini_text_dominant.parquet
- testmini_vision_dominant.parquet
- testmini_vision_intensive.parquet
- testmini_vision_only.parquet
- 特征: 同上
- 分割:
- text_lite: num_examples: 788
- text_dominant: num_examples: 788
- vision_dominant: num_examples: 788
- vision_intensive: num_examples: 788
- vision_only: num_examples: 788
- 数据文件:
-
config_name: testmini_text_only
- 数据文件:
testmini_text_only.parquet - 特征:
- image: string
- 其他特征同上
- 分割: text_only
- num_bytes: 250959
- num_examples: 788
- 数据文件:
搜集汇总
数据集介绍

构建方式
在数学教育领域,视觉化问题解决能力的评估对多模态大语言模型的发展至关重要。MathVerse数据集的构建始于从公开来源精心收集2,612道高质量、涵盖多学科且包含图表元素的数学问题。随后,通过专业标注人员将每道原始问题转化为六个不同版本,这些版本在多模态信息呈现上具有梯度差异,从纯文本到视觉密集型,最终形成总计约15,000个测试样本。这一过程确保了数据在视觉与文本信息平衡上的系统性控制,为评估模型对图表真实理解能力奠定了坚实基础。
特点
该数据集的核心特点在于其精心设计的模态梯度结构。每个数学问题被转化为六个版本,分别对应文本轻量、文本主导、视觉主导、视觉密集、纯视觉及纯文本等不同模态组合,从而全面覆盖多模态推理的各类场景。数据集包含丰富的元数据,如学科主题、子领域及问题类型,并提供了用于评估的标准化查询格式,包括直接提问与思维链提示。这种结构化设计使得MathVerse能够深入探究多模态大语言模型在数学问题解决中视觉信息利用的真实程度与局限性。
使用方法
为有效利用该数据集,研究者可通过HuggingFace平台加载不同配置,例如testmini版本或其细分模态子集。评估时,建议采用数据集倡导的思维链评估策略,即利用高级模型如GPT-4(V)对模型输出的推理步骤进行自适应提取与分步评分,而非简单判断答案正误。这种细粒度评估方法能揭示模型中间推理质量,支持对多模态数学问题解决能力的深入分析。数据集兼容lmms-eval等评估框架,便于集成到现有评估流程中。
背景与挑战
背景概述
在人工智能与多模态学习交叉领域,视觉数学问题求解一直是评估模型深度理解能力的关键前沿。MathVerse数据集由AI4Math团队于2024年创建,其核心研究问题聚焦于探究多模态大语言模型在解析数学图表时的真实视觉理解与推理能力。该数据集精心收集了涵盖平面几何、立体几何、函数等多个数学子领域的2612道高质量题目,并通过专业标注生成了六种不同信息模态的版本,总计构建了约1.5万个测试样本。这一创新设计不仅填补了现有基准在视觉数学评估上的空白,更为推动多模态推理模型的公平、深入评测提供了重要基础,对数学教育智能化与通用人工智能的发展产生了显著影响。
当前挑战
MathVerse数据集致力于解决多模态大语言模型在视觉数学问题求解中的核心挑战,即模型是否真正依赖图表信息进行推理,而非仅利用文本线索。其构建过程面临多重困难:首先,在数据收集阶段,需确保数学题目的多样性与高质量,涵盖不同学科与难度层次;其次,在版本生成过程中,人工标注需精确控制视觉与文本信息的平衡,以创建从纯文本到视觉主导的六种模态变体,这对标注者的专业素养与一致性提出了极高要求。此外,数据集的评估策略也需突破传统二元判断,设计基于思维链的细粒度评分机制,以揭示模型推理过程中的具体错误,这增加了评估的复杂性与计算成本。
常用场景
经典使用场景
在视觉数学问题求解领域,MathVerse数据集为多模态大语言模型(MLLMs)的评估提供了经典场景。该数据集通过将每个数学问题转化为六个不同版本,涵盖从纯文本到视觉密集的多种模态信息组合,使得研究者能够系统性地测试模型对图表信息的真实理解能力。这种设计使得模型在解决几何、函数等数学问题时,必须融合视觉与文本信息进行推理,从而成为评估MLLMs跨模态数学推理能力的标准平台。
解决学术问题
MathVerse数据集旨在解决多模态大语言模型在视觉数学问题中图表理解能力评估不足的学术问题。传统基准往往在文本问题中隐含过多视觉信息,导致模型可能无需真正解读图表即可推导答案。该数据集通过精心构建的多版本问题,剥离了文本与视觉信息的依赖关系,使得研究者能够定量分析模型对视觉内容的依赖程度,从而揭示MLLMs在数学推理中的真实视觉理解能力,推动了多模态推理评估方法的科学化与精细化。
衍生相关工作
MathVerse数据集衍生了一系列经典研究工作,特别是在多模态评估方法创新方面。基于其提出的思维链评估策略,研究者们开发了更精细的推理步骤评分体系,利用GPT-4(V)等模型进行自适应错误分析。这些工作不仅深化了对MLLMs中间推理质量的理解,还催生了新的视觉数学基准构建范式,影响了后续多模态数学数据集的标注与评估标准,推动了整个领域向更严谨、更可解释的方向发展。
以上内容由遇见数据集搜集并总结生成



