WE-MATH, Mathvista, MathVerse, MATH-Vision, MathScape, CMM-Math, OlympiadBench, MV-MATH, ChartBench, MultiChartQA

github2025-03-10 更新2025-03-01 收录

下载链接：

https://github.com/Wild-Cooperation-Hub/Awesome-MLLM-Reasoning-Benchmarks

下载链接

链接失效反馈

官方服务：

资源简介：

README中提到的数据集包括但不限于：用于评估大型多模态模型在数学推理上的能力的WE-MATH数据集，Mathvista数据集，MathVerse数据集，MATH-Vision数据集，MathScape数据集，中文的多模态数学数据集CMM-Math，用于促进AGI发展的OlympiadBench数据集，多视觉环境下评估数学推理的MV-MATH数据集，用于复杂数据图表推理的ChartBench数据集，以及MultiChartQA数据集。

The datasets mentioned in the README include but are not limited to: the WE-MATH dataset for evaluating the mathematical reasoning capabilities of large multimodal models, the Mathvista dataset, the MathVerse dataset, the MATH-Vision dataset, the MathScape dataset, the Chinese multimodal mathematical dataset CMM-Math, the OlympiadBench dataset for promoting the development of AGI, the MV-MATH dataset for evaluating mathematical reasoning in multi-visual environments, the ChartBench dataset for complex data chart reasoning, and the MultiChartQA dataset.

创建时间：

2025-02-27

原始信息汇总

Awesome-MLLM-Reasoning-Benchmarks

Mathematical Reasoning

WE-MATH
- 论文: https://arxiv.org/pdf/2407.01284
- 数据集: https://huggingface.co/datasets/We-Math/We-Math
Mathvista
- 论文: https://arxiv.org/pdf/2310.02255
- 项目: https://mathvista.github.io
- 代码: https://github.com/lupantech/MathVista
- 数据集: https://huggingface.co/datasets/AI4Math/MathVista
- 排行榜: https://mathvista.github.io/#leaderboard
MathVerse
- 论文: https://arxiv.org/pdf/2403.14624
- 项目: https://mathverse-cuhk.github.io
- 代码: https://github.com/ZrrSkywalker/MathVerse
- 数据集: https://huggingface.co/datasets/AI4Math/MathVerse
- 排行榜: https://mathverse-cuhk.github.io/#leaderboard
MATH-Vision
- 论文: https://arxiv.org/pdf/2402.14804
- 项目: https://mathllm.github.io/mathvision
- 代码: https://github.com/mathllm/MATH-V
- 数据集: https://huggingface.co/datasets/MathLLMs/MathVision
- 排行榜: https://mathllm.github.io/mathvision/#leaderboard
MathScape
- 论文: https://arxiv.org/pdf/2408.07543
- 代码: https://github.com/PKU-Baichuan-MLSystemLab/MathScape?tab=readme-ov-file
- 数据集: https://drive.google.com/file/d/1Y3cnKPyryM0_m5QJQIOkF09KjDO9Q_QH/view
CMM-Math
- 论文: https://arxiv.org/pdf/2409.02834
- 代码: https://github.com/ECNU-ICALK/EduChat-Math/
OlympiadBench
- 论文: https://arxiv.org/pdf/2402.14008
- 数据集: https://huggingface.co/datasets/Hothan/OlympiadBench

Chart reasoning

ChartBench
- 论文: https://arxiv.org/pdf/2312.15915
- 项目: https://chartbench.github.io/
- 数据集: https://huggingface.co/datasets/SincereX/ChartBench
MultiChartQA
- 论文: https://arxiv.org/pdf/2410.14179v2
- 代码: https://github.com/Zivenzhu/Multi-chart-QA

Scientific Reasoning

M4U
- 论文: https://arxiv.org/pdf/2405.15638
- 项目: https://m4u-benchmark.github.io/m4u.github.io/
- 数据集: https://huggingface.co/datasets/M4U-Benchmark/M4U
MMMU
- 论文: https://arxiv.org/pdf/2311.16502
- 数据集: https://huggingface.co/datasets/MMMU/MMMU
MMMU-Pro
- 论文: https://arxiv.org/pdf/2409.02813
- 数据集: https://huggingface.co/datasets/MMMU/MMMU_Pro
Science qa
- 论文: https://arxiv.org/pdf/2209.09513
- 数据集: https://huggingface.co/datasets/derek-thomas/ScienceQA
TheoremQA
- 论文: https://arxiv.org/pdf/2305.12524
- 数据集: https://huggingface.co/datasets/TIGER-Lab/TheoremQA
Can MLLMs Reason in Multimodality? EMMA
- 论文: https://arxiv.org/pdf/2501.05444v1
- 数据集: https://huggingface.co/datasets/luckychao/EMMA
GAOKAO-MM
- 论文: https://arxiv.org/pdf/2402.15745
- 代码: https://github.com/OpenMOSS/GAOKAO-MM
OlympiadBench
- 论文: https://arxiv.org/pdf/2402.14008
- 数据集: https://huggingface.co/datasets/Hothan/OlympiadBench
CMMMU
- 论文: https://arxiv.org/pdf/2401.11944
- 数据集: https://huggingface.co/datasets/m-a-p/CMMMU

Code Generation

ChartMimic
- 论文: https://arxiv.org/pdf/2406.09961
- 代码: https://github.com/ChartMimic/ChartMimic
- 数据集: https://huggingface.co/datasets/ChartMimic/ChartMimic
Plot2Code
- 论文: https://arxiv.org/pdf/2405.07990
- 数据集: https://huggingface.co/datasets/TencentARC/Plot2Code
HumanEval-V
- 论文: https://arxiv.org/pdf/2410.12381
- 代码: https://github.com/HumanEval-V/HumanEval-V-Benchmark
- 数据集: https://huggingface.co/datasets/HumanEval-V/HumanEval-V-Benchmark
- 排行榜: https://humaneval-v.github.io/#leaderboard

Multi-Image Based Inductive Reasoning

MM-IQ
- 论文: https://arxiv.org/pdf/2502.00698
- 数据集: https://huggingface.co/datasets/huanqia/MM-IQ
LogicVista
- 论文: https://arxiv.org/pdf/2407.04973
- 代码: https://github.com/Yijia-Xiao/LogicVista
The Jumping Reasoning Curve?
- 论文: https://arxiv.org/pdf/2502.01081
- 代码: https://github.com/declare-lab/LLM-PuzzleTest/

Social and Cultural Knowledge Reasoning

Computational Meme Understanding: A Survey
- 论文: https://aclanthology.org/2024.emnlp-main.1184.pdf
II-Bench
- 论文: https://arxiv.org/pdf/2406.05862
- 数据集: https://huggingface.co/datasets/m-a-p/II-Bench
Can MLLMs Understand the Deep Implication Behind Chinese Images?
- 论文: https://arxiv.org/pdf/2410.13854
- 项目: https://cii-bench.github.io/
- 数据集: https://huggingface.co/datasets/m-a-p/CII-Bench
- 排行榜: https://cii-bench.github.io/#leaderboard
PunchBench
- 论文: https://arxiv.org/pdf/2412.11906
GPT-4V(ision) as A Social Media Analysis Engine
- 论文: https://arxiv.org/pdf/2311.07547
- 代码: https://github.com/VIStA-H/GPT-4V_Social_Media
Geolocation with Real Human Gameplay Data:
- 论文: https://arxiv.org/pdf/2502.13759

Algorithmic Problem

NPHardEval4V
- 论文: https://arxiv.org/pdf/2403.01777
- 代码: https://github.com/lizhouf/NPHardEval4V

Action Prediction

Autonomous Driving
- Exploring the Potential of Multi-Modal AI for Driving Hazard Prediction
  - 论文: https://arxiv.org/pdf/2310.04671v4
  - 代码: https://github.com/DHPR-dataset/DHPR-dataset
  - 数据集: https://huggingface.co/datasets/DHPR/Driving-Hazard-Prediction-and-Reasoning
Robot Manipulation
- A Real-to-Sim-to-Real Approach to Robotic Manipulation with VLM-Generated Iterative Keypoint Rewards
  - 论文: https://arxiv.org/pdf/2502.08643
  - 代码: https://github.com/shivanshpatel35/IKER
  - 项目: https://iker-robot.github.io/
  - 项目: https://simpler-env.github.io/
Gui Agent
- InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection
  - 论文: https://arxiv.org/pdf/2501.04575
  - 代码: https://github.com/Reallm-Labs/InfiGUIAgent
  - 数据集: https://huggingface.co/datasets/Reallm-Labs/InfiGUIAgent-Data
- Mind2Web: Towards a Generalist Agent for the Web
  - 论文: https://arxiv.org/abs/2306.06070
  - 项目: https://osu-nlp-group.github.io/Mind2Web/
  - 代码: https://github.com/OSU-NLP-Group/Mind2Web
  - 数据集: https://huggingface.co/datasets/osunlp/Mind2Web
- SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents
  - 论文: https://arxiv.org/pdf/2401.10935

Spatial Reasoning

Spatial Planing
- Imagine while Reasoning in Space: Multimodal Visualization-of-Thought
  - 论文: https://arxiv.org/pdf/2501.07542
- iVISPAR — An Interactive Visual-Spatial Reasoning Benchmark for VLMs
  - 论文: https://arxiv.org/pdf/2502.03214v1
  - 代码: https://github.com/SharkyBamboozle/iVISPAR
Spatial Relationship
- PulseCheck457: A Diagnostic Benchmark for Comprehensive Spatial Reasoning of Large Multimodal Models
  - 论文: https://www.arxiv.org/pdf/2502.08636
- Defining and Evaluating Visual Language Models’ Basic Spatial Abilities:A Perspective from Psychometrics
  - 论文: https://arxiv.org/pdf/2502.11859

Other Comprehensive MM Reasoning Benchmarks

M3CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought
- 论文: https://arxiv.org/pdf/2405.16473
- 数据集: https://huggingface.co/datasets/LightChen2333/M3CoT
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
- 论文: https://arxiv.org/pdf/2303.11381
- 代码: https://github.com/microsoft/MM-REACT
MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency
- 论文: [https://arxiv.org/pdf/2502

搜集汇总

数据集介绍

构建方式

该数据集通过整合多种数学问题场景，构建了一个用于评估大型多模态模型数学推理能力的综合体系。数据集涵盖了从简单的算术问题到复杂的图表理解和科学推理问题，采用了多种数据来源和构建技术，包括图像、文本和图表等模态的融合，旨在全面考验模型在多模态环境下的数学推理能力。

使用方法

使用该数据集时，研究者可以访问Hugging Face等平台提供的预训练模型，并通过数据集中的测试集来评估模型的性能。数据集的使用通常包括数据加载、预处理、模型推理和性能评估等步骤。此外，数据集还提供了排行榜和 leaderboard，以便研究者在公开的环境中比较和优化他们的模型。

背景与挑战

背景概述

WE-MATH等数据集的构建，源于对大型多模态语言模型在数学推理领域的应用研究。这些数据集的创建时间集中在近年来，主要研究人员包括Yaya Shi和Zongyang Ma等学者。核心研究问题是如何评估和提升大型多模态语言模型在数学等领域的推理能力。这些数据集的发布对相关领域产生了重要影响，推动了数学推理、视觉推理等多模态推理任务的研究。

当前挑战

这些数据集在构建过程中遇到的挑战主要包括：1) 如何设计能够全面、准确地评估数学推理能力的问题和任务；2) 如何处理和整合不同类型的数据，例如文本、图像和图表等；3) 如何确保数据集的多样性和挑战性，以适应不同层次模型的需求。在所解决的领域问题方面，这些数据集面临的挑战包括：如何使模型能够理解复杂的数学问题，进行有效的推理，并生成合理的解答。

常用场景

经典使用场景

WE-MATH等数据集作为数学推理的基准，被广泛应用于评估大型多模态模型在数学问题解决方面的能力。这些数据集通过提供包含数学问题和解答的图像，以及需要跨模态理解和推理的任务，成为检验模型数学推理能力的经典场景。

解决学术问题

这些数据集解决了学术研究中如何有效评估多模态模型在数学推理任务上的表现的问题。它们提供了标准化的测试平台，使得研究者能够在统一的评价标准下，对比不同模型在数学问题解决方面的性能，推动了多模态模型在数学教育领域的应用。

实际应用

在实际应用中，这些数据集可被用于开发和优化教育技术，如智能辅导系统，能够通过分析学生的解题过程来提供个性化的学习建议。此外，它们还可以应用于自动驾驶等领域的算法开发，帮助机器更好地理解空间关系和逻辑推理。

数据集最近研究