GSM8K-V

Name: GSM8K-V
Creator: maas
Published: 2026-05-21 21:02:11
License: 暂无描述

魔搭社区2026-05-21 更新2025-11-03 收录

下载链接：

https://modelscope.cn/datasets/evalscope/GSM8K-V

下载链接

链接失效反馈

官方服务：

资源简介：

<div align="center"> <img src="assets/logo.png" alt="GSM8K-V Logo" width="120px" style="vertical-align: baseline;" /> <h1>GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts?</h1> </div> <div align="center"> [Fan Yuan](mailto:yuanfan7777777@gmail.com)1,\*, [Yuchen Yan](mailto:yanyuchen@zju.edu.cn)1,\*, Yifan Jiang1, Haoran Zhao1, Tao Feng1, Jinyan Chen1, Yanwei Lou1, Wenqi Zhang1, [Yongliang Shen](mailto:syl@zju.edu.cn)1,†, Weiming Lu1, Jun Xiao1, Yueting Zhuang1 </div> 1Zhejiang University *Equal contribution, †Corresponding author 💻 <a href="https://github.com/ZJU-REAL/GSM8K-V">Github</a> | 🤗 <a href="https://huggingface.co/datasets/ZJU-REAL/GSM8K-V">Dataset</a> | 🤗 <a href="https://huggingface.co/papers/2509.25160">Hf-Paper</a> | 📝 <a href="https://arxiv.org/abs/2509.25160">Arxiv</a> | 🌐 <a href="https://zju-real.github.io/GSM8K-V">ProjectPage</a> <img src="assets/intro.png" alt="GSM8K-V Pipeline" style="width: 100%; height: auto; display: block; margin: 0 auto;"> ## 🔔 News - 🔥 **2025.09.30:** Paper is released! 🚀 - 🔥 **2025.09.28:** Code for evaluation is available! 🚀 - 🔥 **2025.09.28:** Home page is available. 🌟 ## 👁️ Overview <img src="assets/main_01.png" alt="GSM8K-V Pipeline"> **GSM8K-V** is a purely visual multi-image mathematical reasoning benchmark that systematically maps each GSM8K math word problem into its visual counterpart to enable a clean, within-item comparison across modalities. Built via an automated pipeline that extracts and allocates problem information across scenes, generates scene-level descriptions, and renders images, coupled with meticulous human annotation, the benchmark comprises 1,319 high-quality multiscene problems (5,343 images) and addresses limitations of prior visual math evaluations that predominantly focus on geometry, seldom cover visualized word problems, and rarely test reasoning across multiple images with semantic dependencies. Evaluations of a broad range of open- and closed-source models reveal a substantial modality gap—for example, Gemini-2.5-Pro attains 95.22% accuracy on text-based GSM8K but only 46.93% on GSM8K-V—highlighting persistent challenges in understanding and reasoning over images in realistic scenarios and providing a foundation to guide the development of more robust and generalizable vision-language models. Our main contributions are summarized as follows. - We propose an automated framework that converts text-based math word problems into visual form. Specifically, we construct detailed multi-scene textual descriptions and leverage image generation models to produce corresponding visual representations. - Building on the proposed data construction framework and careful human annotation, we introduce a vision-based mathematical reasoning benchmark **GSM8K-V**, which enables the evaluation of VLMs on more realistic mathematical problem-solving scenarios. - We perform a thorough evaluation and analysis of the existing VLMs in **GSM8K-V.** The results reveal substantial room for improvement, and our analysis provides valuable insights for enhancing the mathematical reasoning capabilities of future VLMs. ## 🚀 Sample Usage ```bash # Clone the repository git clone https://github.com/ZJU-REAL/GSM8K-V.git cd GSM8K-V # Create conda environment (optional) conda create -n gsm8k-v python=3.10 conda activate gsm8k-v # Install dependencies pip install -r requirements.txt # Command for vllm mode python eval.py --type vllm \ --model_name <eval_model_name> --api_base <vllm_api_base> \ --concurrency <eval_parallel_num> --image_dir <data_path> # Command for api mode python eval.py --type api \ --model_name <eval_model_name> --api_key <your_api_key> \ --concurrency <eval_parallel_num> --image_dir <data_path> ``` ## 📊 Benchmark Statistics <img src="assets/data_statistic.png" alt="Dataset Statistics" width="45%"> <img src="assets/data_distribution_01.png" alt="Category Distribution" width="45%"> ## 📈 Main Results <img src="assets/main_result.png" alt="Main Result" style="width: 100%; height: auto;"> ## ⚙️ Advanced Configuration Options ```bash # Limit number of samples python eval.py --num-samples 5 # Specify evaluation modes python eval.py --modes text_only visual scene # Specify prompt modes for visual evaluation python eval.py --prompt-modes implicit explicit # Evaluate only specific categories python eval.py --data-categories measurement physical_metric # Evaluate specific subcategories python eval.py --data-subcategories distance speed weight # Example Use # ---- vllm start ---- vllm serve model/internvl3_5-8b \ --port 8010 \ --tensor-parallel-size 4 \ --gpu-memory-utilization 0.9 \ --max-model-len 8192 \ --trust-remote-code \ --served-model-name "internvl3.5-8b" # ---- eval start ---- python eval.py --type vllm \ --model_name internvl3.5-8b --api_base http://localhost:8010/v1 \ --concurrency 32 --image_dir data/images # For detailed help python eval.py --help ``` ## 📝 Citation If you find our work helpful, feel free to give us a cite. ``` @misc{yuan2025gsm8kvvisionlanguagemodels, title={GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts}, author={Fan Yuan and Yuchen Yan and Yifan Jiang and Haoran Zhao and Tao Feng and Jinyan Chen and Yanwei Lou and Wenqi Zhang and Yongliang Shen and Weiming Lu and Jun Xiao and Yueting Zhuang}, year={2025}, eprint={2509.25160}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2509.25160}, } ``` ## ✉️ Contact Us If you have any questions, please contact us by email: yuanfan7777777@gmail.com

<div align="center"> <img src="assets/logo.png" alt="GSM8K-V 标志" width="120px" style="vertical-align: baseline;" /> <h1>GSM8K-V：视觉语言模型能否解决视觉语境中的小学低年级数学应用题？</h1> </div> <div align="center"> [范远](mailto:yuanfan7777777@gmail.com)1,*, [严雨辰](mailto:yanyuchen@zju.edu.cn)1,*, 蒋一帆1, 赵浩然1, 冯涛1, 陈锦燕1, 楼彦伟1, 张文琪1, [沈永良](mailto:syl@zju.edu.cn)1,†, 卢为明1, 肖俊1, 庄越挺1 </div> 1浙江大学 *共同第一作者, †通讯作者 💻 <a href="https://github.com/ZJU-REAL/GSM8K-V">GitHub 仓库</a> | 🤗 <a href="https://huggingface.co/datasets/ZJU-REAL/GSM8K-V">数据集</a> | 🤗 <a href="https://huggingface.co/papers/2509.25160">Hugging Face 论文页</a> | 📝 <a href="https://arxiv.org/abs/2509.25160">ArXiv 预印本</a> | 🌐 <a href="https://zju-real.github.io/GSM8K-V">项目主页</a> <img src="assets/intro.png" alt="GSM8K-V 流程示意图" style="width: 100%; height: auto; display: block; margin: 0 auto;"> ## 🔔 最新动态 - 🔥 **2025.09.30：论文正式发布！** 🚀 - 🔥 **2025.09.28：评估代码已开源！** 🚀 - 🔥 **2025.09.28：项目主页正式上线。** 🌟 ## 👁️ 数据集概述 <img src="assets/main_01.png" alt="GSM8K-V 流程示意图"> **GSM8K-V** 是一个纯视觉多场景数学推理基准数据集，其通过系统性地将每个GSM8K数学应用题映射为对应的视觉版本，从而实现跨模态的精准内项对比。本数据集依托自动化流水线构建：从文本问题中提取信息并分配至不同场景，生成场景级描述，再渲染为图像；同时辅以严谨的人工标注。该基准包含1319个高质量多场景问题（对应5343张图像），弥补了现有视觉数学评测的诸多局限：过往评测多聚焦于几何题型，极少覆盖可视化应用题，且鲜有测试跨语义依赖多图像的推理能力。对各类开源与闭源模型的评估结果显示，存在显著的模态鸿沟——例如，Gemini-2.5-Pro在文本版GSM8K上的准确率可达95.22%，但在GSM8K-V上仅为46.93%，这凸显了当前模型在现实场景中理解与推理图像的长期挑战，同时为开发更鲁棒、更具泛化性的视觉语言模型（Vision-Language Model, VLM）提供了研究基础。我们的主要贡献总结如下： - 提出了一种将文本数学应用题转换为视觉形式的自动化框架。具体而言，我们构建了精细的多场景文本描述，并借助图像生成模型生成对应的视觉表征。 - 依托所提出的数据构建框架与严谨的人工标注，我们推出了基于视觉的数学推理基准数据集**GSM8K-V**，可用于评估视觉语言模型在更贴近现实的数学解题场景中的性能。 - 我们在**GSM8K-V**上对现有视觉语言模型进行了全面的评估与分析。结果显示，现有模型仍有较大的性能提升空间，我们的分析也为未来提升视觉语言模型的数学推理能力提供了宝贵的研究思路。 ## 🚀 示例用法 bash # 克隆仓库 git clone https://github.com/ZJU-REAL/GSM8K-V.git cd GSM8K-V # 创建conda环境（可选） conda create -n gsm8k-v python=3.10 conda activate gsm8k-v # 安装依赖 pip install -r requirements.txt # vllm 模式命令 python eval.py --type vllm --model_name <eval_model_name> --api_base <vllm_api_base> --concurrency <eval_parallel_num> --image_dir <data_path> # API 模式命令 python eval.py --type api --model_name <eval_model_name> --api_key <your_api_key> \ --concurrency <eval_parallel_num> --image_dir <data_path> ## 📊 基准数据集统计 <img src="assets/data_statistic.png" alt="数据集统计信息" width="45%"> <img src="assets/data_distribution_01.png" alt="题型分布" width="45%"> ## 📈 主要实验结果 <img src="assets/main_result.png" alt="主要实验结果" style="width: 100%; height: auto;"> ## ⚙️ 高级配置选项 bash # 限制采样数量 python eval.py --num-samples 5 # 指定评估模式 python eval.py --modes text_only visual scene # 指定视觉评估的提示模式 python eval.py --prompt-modes implicit explicit # 仅评估指定类别 python eval.py --data-categories measurement physical_metric # 评估特定子类别 python eval.py --data-subcategories distance speed weight # 示例使用 # ---- vllm 启动 ---- vllm serve model/internvl3_5-8b --port 8010 --tensor-parallel-size 4 --gpu-memory-utilization 0.9 --max-model-len 8192 --trust-remote-code --served-model-name "internvl3.5-8b" # ---- 评估启动 ---- python eval.py --type vllm --model_name internvl3.5-8b --api_base http://localhost:8010/v1 --concurrency 32 --image_dir data/images # 查看详细帮助 python eval.py --help ## 📝 引用格式如果您认为我们的工作对您有所帮助，请引用以下文献： @misc{yuan2025gsm8kvvisionlanguagemodels, title={GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts}, author={Fan Yuan and Yuchen Yan and Yifan Jiang and Haoran Zhao and Tao Feng and Jinyan Chen and Yanwei Lou and Wenqi Zhang and Yongliang Shen and Weiming Lu and Jun Xiao and Yueting Zhuang}, year={2025}, eprint={2509.25160}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2509.25160}, } ## ✉️ 联系我们如果您有任何疑问，请通过以下邮箱联系我们： yuanfan7777777@gmail.com

提供机构：

maas

创建时间：

2025-11-19

5,000+

优质数据集

54 个

任务类型

进入经典数据集