JamesGoGo/DeepVision-103K

Name: JamesGoGo/DeepVision-103K
Creator: JamesGoGo
Published: 2026-03-04 11:31:40
License: 暂无描述

Hugging Face2026-03-04 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/JamesGoGo/DeepVision-103K

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: mit size_categories: - 100K<n<1M task_categories: - image-text-to-text pretty_name: DeepVision-103K tags: - math - multimodal - reasoning - rl configs: - config_name: visual_logic data_files: - split: train path: visual_logic-26k.parquet - config_name: math data_files: - split: train path: math-77k.parquet --- <div align="center"> # 🔭 DeepVision-103K <div> A Visually Diverse, Broad-Coverage, and Verifiable Mathematical Dataset for Multimodal Reasoning </div> </div> <div> <br> <div align="center"> [![Data](https://img.shields.io/badge/Data-4d5eff?style=for-the-badge&logo=huggingface&logoColor=ffffff&labelColor)](https://huggingface.co/datasets/skylenage/DeepVision-103K) [![Github](https://img.shields.io/badge/Code-000000?style=for-the-badge&logo=github&logoColor=white)](https://github.com/SKYLENAGE-AI/DeepVision-103K) [![Paper](https://img.shields.io/badge/Paper-2602.16742-b31b1b.svg?style=for-the-badge)](https://huggingface.co/papers/2602.16742) </div> </div> Training on DeepVision-103K yields **top performance** on both multimodal mathematical reasoning and general multimodal benchmarks: <div align="center"> <img src="./assets/perf.png" width="100%"/> <sub>Average Performance on multimodal math and general multimodal benchmarks.</sub> </div> Training on DeepVision-103K elicits more efficient reasoning. | Benchmark | Qwen3-VL-8B-Instruct (Acc / Tokens) | Qwen3-VL-8B-DeepVision (Acc / Tokens) | Qwen3-VL-8B-Thinking (Acc / Tokens) | | ----------- | ----------------------------------- | ------------------------------------- | ----------------------------------- | | WeMath | 79.36 / 1428 | 85.11 / 2010 | 84.54 / 3754 | | MathVision | 51.44 / 4288 | 55.49 / 5738 | 57.89 / 8970 | | MathVerse | 67.38 / 1572 | 72.46 / 2714 | 72.84 / 4665 | | LogicVista | 61.16 / 1769 | 64.73 / 2716 | 64.73 / 6115 | | MMMU_val | 67.66 / 2099 | 71.33 / 2758 | 69.33 / 5082 | | MMMU_Pro | 67.69 / 2170 | 70.29 / 2895 | 70.29 / 5037 | | M³CoT | 70.83 / 1029 | 71.61 / 1294 | 71.31 / 2761 | | **Average** | 66.50 / 2333 | **70.15 / 3173** | 70.13 / 4995 | ## 📢 News - **Feb 16, 2026**: We release **`DeepVision-103K`**, a large-scale, visually diverse, and verifiable multimodal mathematical dataset for advancing multimodal reasoning via RLVR. ## 📦 Resource - 🧩 Training data: [`DeepVision-103K`](https://huggingface.co/datasets/skylenage/DeepVision-103K) - 💻 Code: [`DeepVision-103K`](https://github.com/SKYLENAGE-AI/DeepVision-103K) - 📄 Paper: [DeepVision-103K: A Visually Diverse, Broad-Coverage, and Verifiable Mathematical Dataset for Multimodal Reasoning](https://huggingface.co/papers/2602.16742) ## 📝 Overview **`DeepVision-103K`** is a dataset designed for LMM Reasoning, curated from diverse real-world K12 educational sources. Key features include: **1. Visual Diversity**: DeepVision-103K covers planar geometry, solid geometry, analytic plots, data charts, schematic diagrams, and real-world items in mathematical contexts. <div align="center"> <img src="./assets/visual_elements.png" width="100%"/> <sub>Visual elements in DeepVision-103K</sub> </div> Within each category, DeepVision offers richer element types than existing open-source datasets. <div align="center"> <img src="./assets/ve3.png" width="100%"/> <sub>The number of different visual element types across training datasets.</sub> </div> **2. Broad Coverage**: DeepVision-103K spans Geometry, Algebra, Probability & Statistics, and Fundamental Mathematical Skills. <div align="center"> <img src="./assets/domain.png" width="400"/> <sub>Hierarchical breakdown of mathematical topics covered in DeepVision-103K.</sub> </div> **3. Rich Data Format**: Each sample contains structured annotations to support various downstream tasks: <div align="center"> <img src="./assets/overview.png" width="600"/> <sub>A data sample from DeepVision-103K.</sub> </div> - **Question & Image**: Problem statement and corresponding image. - **Final Answer**: A unique, verifiable answer enabling rule-based reward computation in RLVR. - **Pass Rate**: The proportion of correct responses obtained during model rollouts. - **Topic**: Hierarchical classification of the mathematical branch. - **Knowledge Points**: Specific mathematical concepts, theorems, or techniques required. - **Visual Elements**: Geometric or graphical objects depicted in the image. ## Curation Pipeline A three-stage pipeline transforms diverse but noisy real-world K12 problems into structured and verifiable QA pairs: - **Validity Filtering**: Remove problems unsuitable for RL (proof-based, descriptive, multi-answer questions). - **Difficulty Filtering**: Calibrate sample difficulty via model rollout pass rates. - **Query Correctness Verification**: Validate image-question pairs and answers using Gemini-3-Flash. <div align="center"> <img src="./assets/pipeline.png" width="600"/> <sub>Curation pipeline for mathematical data in DeepVision-103K.</sub> </div> ## 📊 Main Results Training on DeepVision-103K yields **top performance** on both multimodal mathematical reasoning and general multimodal benchmarks: <div align="center"> <img src="./assets/perf.png" width="100%"/> <sub>Average Performance on multimodal math and general multimodal benchmarks.</sub> </div> <div align="center"> <img src="./assets/bench_results.png" width="600"/> <sub>Specific Performance on multimodal math and general multimodal benchmarks.</sub> </div> ## DeepVision-103k Training & Evaluation Toolkit We use [GSPO](https://arxiv.org/abs/2507.18071) for training and [vllm](https://github.com/vllm-project/vllm) for async batch evaluation. The training code is built on top of [verl](https://github.com/volcengine/verl). We use [swanlab](https://github.com/SwanHubX/SwanLab) for experiment tracking. ### Installation #### Recommended Environment We recommend the following environment configuration: - CUDA 12.8 - PyTorch 2.8.0 - vLLM 0.11.0 - Transformers 4.57.1 #### Setup Steps ```bash # Clone the repo git clone https://github.com/SKYLENAGE-AI/DeepVision-103K && cd DeepVision-103K # Install mathverify for rule-based verification pip install mathverify # Install qwen_vl_utils for model training pip install qwen_vl_utils # Install verl in editable mode pip install -e . ``` --- ### Training Two training templates are provided under `train_scripts/`. Both use the GSPO algorithm with GRPO advantage estimation. #### Quick Start 1. **Search for `{YOUR_`** in the script to find all placeholders that need to be filled in: | Placeholder | Description | |---|---| | `{YOUR_SWANLAB_API_KEY}` | Your SwanLab API key (for experiment tracking) | | `{YOUR_PROJECT_NAME}` | Project name for experiment grouping | | `{YOUR_BASE_MODEL}` | Base model identifier (used in experiment naming) | | `{YOUR_ROOT_PATH}` | Root directory for saving checkpoints | | `{YOUR_MODEL_PATH}` | Path to the pretrained model (e.g. HuggingFace format) | | `{YOUR_TRAIN_FILE}` | Path to training data (`.parquet` format) | | `{YOUR_TEST_FILE}` | Path to validation data (`.parquet` format) | 2. **Uncomment the GPU setting block** that matches your cluster size (8 / 16 / 32 / 64 GPUs). 3. **Run the script.** #### Single-Node Training (8/16 GPUs on one machine) ```bash bash train_scripts/train_single_node_template.sh ``` #### Multi-Node Training (Ray cluster across multiple machines) ```bash # Submit to each node via your job scheduler # Environment variables RANK, WORLD_SIZE, MASTER_ADDR must be set by the scheduler bash train_scripts/train_multi_node_template.sh ``` ### Evaluation The evaluation pipeline under `eval_scripts/` provides inference and evaluation scripts. #### Inference 1. **Fill in placeholders** in `caller.sh`: ```bash python caller_async.py \ --model /path/to/your/model \ --input /path/to/input.jsonl \ --output /path/to/output.jsonl \ --hyperparam mimo \ --prompt-field prompt \ --gpu-devices "0,1,2,3,4,5,6,7" \ --tensor-parallel-size 1 \ --data-parallel-size 8 \ --concurrent-per-endpoint 16 \ --max-tokens 16384 \ --n 8 ``` 2. **Run:** ```bash cd eval_scripts bash caller.sh ``` ### Post-Inference Evaluation After inference is complete, use the evaluation tools under `eval_scripts/evaluation/` to score and analyze results. #### Step 1: Math-Verify Rule-Based Evaluation Run the math-verify judge to compute accuracy and automatically export error cases: ```bash python eval_scripts/evaluation/mathverify_judge.py -i /path/to/your_output.jsonl ``` #### Step 2: GPT-5-mini Re-Judge on Error Cases For the exported error cases (`*_mathverify_error.jsonl`), use GPT-5-mini as a secondary judge to catch false negatives from rule-based matching. The judge prompt template is defined in `eval_scripts/evaluation/gpt5-mini-judge_prompt.md`. ## 📖 Citation ```bibtex @misc{sun2026deepvision103kvisuallydiversebroadcoverage, title={DeepVision-103K: A Visually Diverse, Broad-Coverage, and Verifiable Mathematical Dataset for Multimodal Reasoning}, author={Haoxiang Sun and Lizhen Xu and Bing Zhao and Wotao Yin and Wei Wang and Boyu Yang and Rui Wang and Hu Wei}, year={2026}, eprint={2602.16742}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2602.16742}, } ``` ## 🙏 Acknowledgements This work builds upon the following resources: - **[MM-MathInstruct-3M](https://huggingface.co/datasets/MathLLMs/MM-MathInstruct)**: Large-scale multimodal math instruction data from real educational contexts. - **[MultiMath-300K](https://huggingface.co/datasets/pengshuai-rin/multimath-300k)**: Multimodal mathematical dataset from real educational contexts. - **[Zebra-CoT](https://huggingface.co/datasets/multimodal-reasoning-lab/Zebra-CoT)**: Visual logic reasoning problems. - **[GameQA](https://huggingface.co/datasets/OpenMOSS-Team/GameQA-140K)**: Game-based visual reasoning tasks.

language: - 英语 license: MIT协议 size_categories: - 10万 < n < 100万 task_categories: - 图像-文本转文本 pretty_name: DeepVision-103K tags: - 数学 - 多模态 - 推理 - 强化学习（Reinforcement Learning，RL） configs: - config_name: visual_logic data_files: - split: train path: visual_logic-26k.parquet - config_name: math data_files: - split: train path: math-77k.parquet --- <div align="center"> # 🔭 DeepVision-103K <div> 一款面向多模态推理的视觉多样化、覆盖广泛且可验证的数学数据集 </div> </div> <div> <br> <div align="center"> [![数据集](https://img.shields.io/badge/Data-4d5eff?style=for-the-badge&logo=huggingface&logoColor=ffffff&labelColor)](https://huggingface.co/datasets/skylenage/DeepVision-103K) [![代码](https://img.shields.io/badge/Code-000000?style=for-the-badge&logo=github&logoColor=white)](https://github.com/SKYLENAGE-AI/DeepVision-103K) [![论文](https://img.shields.io/badge/Paper-2602.16742-b31b1b.svg?style=for-the-badge)](https://huggingface.co/papers/2602.16742) </div> </div> 在DeepVision-103K上进行训练，可在多模态数学推理与通用多模态基准测试中均取得**顶尖性能**： <div align="center"> <img src="./assets/perf.png" width="100%"/> <sub>多模态数学与通用多模态基准测试的平均性能表现。</sub> </div> 在DeepVision-103K上训练可催生更高效的推理过程。 | 基准测试 | Qwen3-VL-8B-Instruct（准确率 / Token数） | Qwen3-VL-8B-DeepVision（准确率 / Token数） | Qwen3-VL-8B-Thinking（准确率 / Token数） | | ----------- | ----------------------------------- | ------------------------------------- | ----------------------------------- | | WeMath | 79.36 / 1428 | 85.11 / 2010 | 84.54 / 3754 | | MathVision | 51.44 / 4288 | 55.49 / 5738 | 57.89 / 8970 | | MathVerse | 67.38 / 1572 | 72.46 / 2714 | 72.84 / 4665 | | LogicVista | 61.16 / 1769 | 64.73 / 2716 | 64.73 / 6115 | | MMMU_val | 67.66 / 2099 | 71.33 / 2758 | 69.33 / 5082 | | MMMU_Pro | 67.69 / 2170 | 70.29 / 2895 | 70.29 / 5037 | | M³CoT | 70.83 / 1029 | 71.61 / 1294 | 71.31 / 2761 | | **平均** | 66.50 / 2333 | **70.15 / 3173** | 70.13 / 4995 | ## 📢 动态 - **2026年2月16日**：我们发布了**`DeepVision-103K`**，一款面向通过RLVR推进多模态推理的大规模、视觉多样化且可验证的多模态数学数据集。 ## 📦 资源 - 🧩 训练数据：[`DeepVision-103K`](https://huggingface.co/datasets/skylenage/DeepVision-103K) - 💻 代码：[`DeepVision-103K`](https://github.com/SKYLENAGE-AI/DeepVision-103K) - 📄 论文：[《DeepVision-103K：一款面向多模态推理的视觉多样化、覆盖广泛且可验证的数学数据集》](https://huggingface.co/papers/2602.16742) ## 📝 概述 **`DeepVision-103K`**是一款为大语言模型视觉（Large Multimodal Model，LMM）推理设计的数据集，源自多样化的真实K12教育资源。其核心特性如下： **1. 视觉多样性**：DeepVision-103K涵盖数学场景中的平面几何、立体几何、解析图表、数据统计图、示意图以及真实物品。 <div align="center"> <img src="./assets/visual_elements.png" width="100%"/> <sub>DeepVision-103K中的视觉元素</sub> </div> 在每个类别中，DeepVision提供的元素类型均多于现有开源数据集。 <div align="center"> <img src="./assets/ve3.png" width="100%"/> <sub>各训练数据集的不同视觉元素类型数量</sub> </div> **2. 覆盖广泛**：DeepVision-103K涵盖几何、代数、概率与统计以及基础数学技能四大领域。 <div align="center"> <img src="./assets/domain.png" width="400"/> <sub>DeepVision-103K涵盖的数学主题分层拆解</sub> </div> **3. 丰富的数据格式**：每个样本均包含结构化标注，可支撑各类下游任务： <div align="center"> <img src="./assets/overview.png" width="600"/> <sub>DeepVision-103K中的一个数据样本</sub> </div> - **问题与图像**：问题描述与对应图像。 - **最终答案**：唯一且可验证的答案，可用于在RLVR中基于规则计算奖励。 - **通过率**：模型推理过程中获得正确响应的比例。 - **主题**：数学分支的分层分类。 - **知识点**：所需的特定数学概念、定理或技巧。 - **视觉元素**：图像中描绘的几何或图形对象。 ## 数据构建流水线我们采用三阶段流水线，将多样化但存在噪声的真实K12问题转化为结构化且可验证的问答对： - **有效性过滤**：移除不适合强化学习（Reinforcement Learning，RL）的问题（如基于证明、描述性、多答案的问题）。 - **难度过滤**：通过模型推理的通过率校准样本难度。 - **查询正确性验证**：使用Gemini-3-Flash验证图像-问题对与答案的正确性。 <div align="center"> <img src="./assets/pipeline.png" width="600"/> <sub>DeepVision-103K中数学数据的构建流水线</sub> </div> ## 📊 主要实验结果在DeepVision-103K上进行训练，可在多模态数学推理与通用多模态基准测试中均取得**顶尖性能**： <div align="center"> <img src="./assets/perf.png" width="100%"/> <sub>多模态数学与通用多模态基准测试的平均性能表现。</sub> </div> <div align="center"> <img src="./assets/bench_results.png" width="600"/> <sub>多模态数学与通用多模态基准测试的具体性能表现。</sub> </div> ## DeepVision-103K 训练与评估工具包我们使用[GSPO](https://arxiv.org/abs/2507.18071)进行训练，使用[vllm](https://github.com/vllm-project/vllm)进行异步批量评估。训练代码基于[verl](https://github.com/volcengine/verl)开发，我们使用[swanlab](https://github.com/SwanHubX/SwanLab)进行实验追踪。 ### 环境配置 #### 推荐环境我们推荐以下环境配置： - CUDA 12.8 - PyTorch 2.8.0 - vLLM 0.11.0 - Transformers 4.57.1 #### 安装步骤 bash # 克隆代码仓库 git clone https://github.com/SKYLENAGE-AI/DeepVision-103K && cd DeepVision-103K # 安装mathverify以支持基于规则的验证 pip install mathverify # 安装qwen_vl_utils以支持模型训练 pip install qwen_vl_utils # 以可编辑模式安装verl pip install -e . --- ### 训练 `train_scripts/`目录下提供了两种训练模板，均采用结合GRPO优势估计的GSPO算法。 #### 快速开始 1. **在脚本中搜索`{YOUR_`**，找到所有需要填写的占位符： | 占位符 | 说明 | |---|---| | `{YOUR_SWANLAB_API_KEY}` | 用于实验追踪的SwanLab API密钥 | | `{YOUR_PROJECT_NAME}` | 用于实验分组的项目名称 | | `{YOUR_BASE_MODEL}` | 基础模型标识符（用于实验命名） | | `{YOUR_ROOT_PATH}` | 保存模型检查点的根目录 | | `{YOUR_MODEL_PATH}` | 预训练模型的路径（例如HuggingFace格式） | | `{YOUR_TRAIN_FILE}` | 训练数据的路径（`.parquet`格式） | | `{YOUR_TEST_FILE}` | 验证数据的路径（`.parquet`格式） | 2. **取消与你的集群规模（8/16/32/64张GPU）匹配的GPU设置块的注释**。 3. **运行脚本**。 #### 单节点训练（单机器上8/16张GPU） bash bash train_scripts/train_single_node_template.sh #### 多节点训练（跨多机器的Ray集群） bash # 通过作业调度器提交到每个节点 # 环境变量RANK、WORLD_SIZE、MASTER_ADDR需由调度器设置 bash train_scripts/train_multi_node_template.sh ### 评估 `eval_scripts/`目录下的评估流水线提供了推理与评估脚本。 #### 推理 1. **在`caller.sh`中填写占位符**： bash python caller_async.py --model /path/to/your/model --input /path/to/input.jsonl --output /path/to/output.jsonl --hyperparam mimo --prompt-field prompt --gpu-devices "0,1,2,3,4,5,6,7" --tensor-parallel-size 1 --data-parallel-size 8 --concurrent-per-endpoint 16 --max-tokens 16384 --n 8 2. **运行：** bash cd eval_scripts bash caller.sh ### 推理后评估推理完成后，使用`eval_scripts/evaluation/`下的评估工具对结果进行评分与分析。 #### 步骤1：基于规则的Math-Verify评估运行math-verify评估器以计算准确率，并自动导出错误样本： bash python eval_scripts/evaluation/mathverify_judge.py -i /path/to/your_output.jsonl #### 步骤2：对错误样本使用GPT-5-mini二次评估对于导出的错误样本（`*_mathverify_error.jsonl`），使用GPT-5-mini作为二次评估器，以捕获基于规则匹配产生的假阴性结果。评估提示模板定义于`eval_scripts/evaluation/gpt5-mini-judge_prompt.md`中。 ## 📖 引用 bibtex @misc{sun2026deepvision103kvisuallydiversebroadcoverage, title={DeepVision-103K: A Visually Diverse, Broad-Coverage, and Verifiable Mathematical Dataset for Multimodal Reasoning}, author={Haoxiang Sun and Lizhen Xu and Bing Zhao and Wotao Yin and Wei Wang and Boyu Yang and Rui Wang and Hu Wei}, year={2026}, eprint={2602.16742}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2602.16742}, } ## 🙏 致谢本研究基于以下资源构建： - **[MM-MathInstruct-3M](https://huggingface.co/datasets/MathLLMs/MM-MathInstruct)**：源自真实教育场景的大规模多模态数学教学数据。 - **[MultiMath-300K](https://huggingface.co/datasets/pengshuai-rin/multimath-300k)**：源自真实教育场景的多模态数学数据集。 - **[Zebra-CoT](https://huggingface.co/datasets/multimodal-reasoning-lab/Zebra-CoT)**：视觉逻辑推理问题数据集。 - **[GameQA](https://huggingface.co/datasets/OpenMOSS-Team/GameQA-140K)**：基于游戏的视觉推理任务数据集。

提供机构：

JamesGoGo

5,000+

优质数据集

54 个

任务类型

进入经典数据集