VideoMathQA

Name: VideoMathQA
Creator: maas
Published: 2026-05-09 22:28:36
License: 暂无描述

魔搭社区2026-05-09 更新2025-06-07 收录

下载链接：

https://modelscope.cn/datasets/MBZUAI/VideoMathQA

下载链接

链接失效反馈

官方服务：

资源简介：

# VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos [![Paper](https://img.shields.io/badge/📄_arXiv-Paper-blue)](https://arxiv.org/abs/2506.05349) [![Website](https://img.shields.io/badge/🌐_Project-Website-87CEEB)](https://mbzuai-oryx.github.io/VideoMathQA) [![🏅 Leaderboard (Reasoning)](https://img.shields.io/badge/🏅_Leaderboard-Reasoning-red)](https://hanoonar.github.io/VideoMathQA/#leaderboard-2) [![🏅 Leaderboard (Direct)](https://img.shields.io/badge/🏅_Leaderboard-Direct-yellow)](https://hanoonar.github.io/VideoMathQA/#leaderboard) [![📊 Eval (LMMs-Eval)](https://img.shields.io/badge/📊_Eval-LMMs--Eval-orange)](https://github.com/EvolvingLMMs-Lab/lmms-eval/tree/main/lmms_eval/tasks/videomathqa) ## 📣 Announcement Note that the Official evaluation for **VideoMathQA** is supported in the [`lmms-eval`](https://github.com/EvolvingLMMs-Lab/lmms-eval/tree/main/lmms_eval/tasks/videomathqa) framework. Please use the GitHub repository [`mbzuai-oryx/VideoMathQA`](https://github.com/mbzuai-oryx/VideoMathQA) to create or track any issues related to VideoMathQA that you may encounter. --- ## 💡 VideoMathQA **VideoMathQA** is a benchmark designed to evaluate mathematical reasoning in real-world educational videos. It requires models to interpret and integrate information from **three modalities**, visuals, audio, and text, across time. The benchmark tackles the **needle-in-a-multimodal-haystack** problem, where key information is sparse and spread across different modalities and moments in the video. <img src="images/intro_fig.png" alt="Highlight Figure"> The foundation of our benchmark is the needle-in-a-multimodal-haystack challenge, capturing the core difficulty of cross-modal reasoning across time from visual, textual, and audio streams. Built on this, VideoMathQA categorizes each question along four key dimensions: reasoning type, mathematical concept, video duration, and difficulty. --- ## 🔥 Highlights - **Multimodal Reasoning Benchmark:** VideoMathQA introduces a challenging **needle-in-a-multimodal-haystack** setup where models must reason across **visuals, text and audio**. Key information is **sparsely distributed across modalities and time**, requiring strong performance in fine-grained visual understanding, multimodal integration, and reasoning. - **Three Types of Reasoning:** Questions are categorized into: **Problem Focused**, where the question is explicitly stated and solvable via direct observation and reasoning from the video; **Concept Transfer**, where a demonstrated method or principle is adapted to a newly posed problem; **Deep Instructional Comprehension**, which requires understanding long-form instructional content, interpreting partially worked-out steps, and completing the solution. - **Diverse Evaluation Dimensions:** Each question is evaluated across four axes, which captures diversity in content, length, complexity, and reasoning depth. **mathematic concepts**, 10 domains such as geometry, statistics, arithmetics and charts; **video duration** ranging from 10s to 1 hour long categorized as short, medium, long; **difficulty level**; and **reasoning type**. - **High-Quality Human Annotations:** The benchmark includes **420 expert-curated questions**, each with five answer choices, a correct answer, and detailed **chain-of-thought (CoT) steps**. Over **2,945 reasoning steps** have been manually written, reflecting **920+ hours** of expert annotation effort with rigorous quality control. ## 🔍 Examples from the Benchmark We present example questions from VideoMathQA illustrating the three reasoning types: Problem Focused, Concept Transfer, and Deep Comprehension. The benchmark includes evolving dynamics in a video, complex text prompts, five multiple-choice options, the expert-annotated step-by-step reasoning to solve the given problem, and the final correct answer as shown above. <img src="images/data_fig.png" alt="Figure 1" width="90%"> --- ## 📈 Overview of VideoMathQA We illustrate an overview of the VideoMathQA benchmark through: a) The distribution of questions and model performance across ten mathematical concepts, which highlights a significant gap in the current multimodal models and their ability to perform mathematical reasoning over videos. b) The distribution of video durations, spanning from short clips of 10s to long videos up to 1hr. c) Our three-stage annotation pipeline performed by expert science graduates, who annotate detailed step-by-step reasoning trails, with strict quality assessment at each stage. <img src="images/stat_fig.png" alt="Figure 2" width="90%">

# VideoMathQA：基于视频多模态理解的数学推理评测基准 [![Paper](https://img.shields.io/badge/📄_arXiv-Paper-blue)](https://arxiv.org/abs/2506.05349) [![Website](https://img.shields.io/badge/🌐_Project-Website-87CEEB)](https://mbzuai-oryx.github.io/VideoMathQA) [![🏅 Leaderboard (Reasoning)](https://img.shields.io/badge/🏅_Leaderboard-Reasoning-red)](https://hanoonar.github.io/VideoMathQA/#leaderboard-2) [![🏅 Leaderboard (Direct)](https://img.shields.io/badge/🏅_Leaderboard-Direct-yellow)](https://hanoonar.github.io/VideoMathQA/#leaderboard) [![📊 Eval (LMMs-Eval)](https://img.shields.io/badge/📊_Eval-LMMs--Eval-orange)](https://github.com/EvolvingLMMs-Lab/lmms-eval/tree/main/lmms_eval/tasks/videomathqa) ## 📣 公告请注意，VideoMathQA的官方评测已在[`lmms-eval`](https://github.com/EvolvingLMMs-Lab/lmms-eval/tree/main/lmms_eval/tasks/videomathqa)框架中得到支持。请使用GitHub仓库[`mbzuai-oryx/VideoMathQA`](https://github.com/mbzuai-oryx/VideoMathQA)来创建或追踪与VideoMathQA相关的任何问题。 --- ## 💡 VideoMathQA评测基准 **VideoMathQA**是一款专为评测真实世界教学视频中的数学推理能力而设计的基准数据集。该基准要求模型跨时间维度解读并整合来自**三种模态**的信息：视觉、音频与文本。本基准针对**多模态大海捞针（needle-in-a-multimodal-haystack）**问题设计，这类问题的关键信息稀疏且分散在不同模态与视频的不同时刻中。 <img src="images/intro_fig.png" alt="Highlight Figure"> 本基准的核心挑战为多模态大海捞针任务，旨在捕捉从视觉、文本与音频流中进行跨时间跨模态推理的核心难点。基于此，VideoMathQA从四个关键维度对每个问题进行分类：推理类型、数学概念、视频时长与难度等级。 --- ## 🔥 核心亮点 - **多模态推理基准**：VideoMathQA提出了极具挑战性的**多模态大海捞针（needle-in-a-multimodal-haystack）**设定，要求模型对视觉、文本与音频信息进行跨模态推理。关键信息稀疏分布于不同模态与时间线中，这要求模型具备出色的细粒度视觉理解、多模态融合与推理能力。 - **三类推理任务**：问题可分为以下三类：**问题聚焦型（Problem Focused）**：问题已明确给出，可通过直接观察视频内容并进行推理求解；**概念迁移型（Concept Transfer）**：需将视频中演示的方法或原理迁移至新提出的问题中；**深度教学理解型（Deep Instructional Comprehension）**：要求理解长时长教学内容，解读部分完成的推导步骤并补全完整解答。 - **多元化评测维度**：每个问题均从四个维度进行评测，以覆盖内容、长度、复杂度与推理深度的多样性。具体包括：10个数学概念域（涵盖几何、统计、算术与图表等）；**视频时长**：从10秒到1小时不等，分为短、中、长三类；**难度等级**；以及**推理类型**。 - **高质量人工标注**：该基准包含420道经专家精选编撰的问题，每个问题配有五个备选答案、一个正确答案，以及详细的**思维链（chain-of-thought, CoT）**推理步骤。目前已手动编写超过2945条推理步骤，累计投入超920小时的专家标注工作，并经过严格的质量管控。 ## 🔍 基准数据集示例本章节展示来自VideoMathQA的示例问题，涵盖三类推理类型：问题聚焦型、概念迁移型与深度教学理解型。该基准包含视频中的动态演化场景、复杂文本提示、五个多项选择选项，以及专家标注的问题求解分步推理过程与最终正确答案，如上图所示。 <img src="images/data_fig.png" alt="Figure 1" width="90%"> --- ## 📈 VideoMathQA整体概览我们通过以下内容展示VideoMathQA基准的整体概况：a) 各数学概念域的问题分布与模型性能，该分析揭示了当前多模态模型在视频数学推理任务上存在显著性能差距；b) 视频时长分布，覆盖从10秒的短片段到长达1小时的长视频；c) 我们采用的三阶段标注流程：由理科硕士毕业生组成的专家团队进行标注，撰写详细的分步推理过程，并在每个阶段执行严格的质量评估。 <img src="images/stat_fig.png" alt="Figure 2" width="90%">

提供机构：

maas

创建时间：

2025-06-02

5,000+

优质数据集

54 个

任务类型

进入经典数据集