JamesGoGo/DeepVision-103K
收藏Hugging Face2026-03-04 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/JamesGoGo/DeepVision-103K
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: mit
size_categories:
- 100K<n<1M
task_categories:
- image-text-to-text
pretty_name: DeepVision-103K
tags:
- math
- multimodal
- reasoning
- rl
configs:
- config_name: visual_logic
data_files:
- split: train
path: visual_logic-26k.parquet
- config_name: math
data_files:
- split: train
path: math-77k.parquet
---
<div align="center">
# 🔭 DeepVision-103K
<div>
A Visually Diverse, Broad-Coverage, and Verifiable Mathematical Dataset for Multimodal Reasoning
</div>
</div>
<div>
<br>
<div align="center">
[](https://huggingface.co/datasets/skylenage/DeepVision-103K)
[](https://github.com/SKYLENAGE-AI/DeepVision-103K)
[](https://huggingface.co/papers/2602.16742)
</div>
</div>
Training on DeepVision-103K yields **top performance** on both multimodal mathematical reasoning and general multimodal benchmarks:
<div align="center"> <img src="./assets/perf.png" width="100%"/>
<sub>Average Performance on multimodal math and general multimodal benchmarks.</sub> </div>
Training on DeepVision-103K elicits more efficient reasoning.
| Benchmark | Qwen3-VL-8B-Instruct (Acc / Tokens) | Qwen3-VL-8B-DeepVision (Acc / Tokens) | Qwen3-VL-8B-Thinking (Acc / Tokens) |
| ----------- | ----------------------------------- | ------------------------------------- | ----------------------------------- |
| WeMath | 79.36 / 1428 | 85.11 / 2010 | 84.54 / 3754 |
| MathVision | 51.44 / 4288 | 55.49 / 5738 | 57.89 / 8970 |
| MathVerse | 67.38 / 1572 | 72.46 / 2714 | 72.84 / 4665 |
| LogicVista | 61.16 / 1769 | 64.73 / 2716 | 64.73 / 6115 |
| MMMU_val | 67.66 / 2099 | 71.33 / 2758 | 69.33 / 5082 |
| MMMU_Pro | 67.69 / 2170 | 70.29 / 2895 | 70.29 / 5037 |
| M³CoT | 70.83 / 1029 | 71.61 / 1294 | 71.31 / 2761 |
| **Average** | 66.50 / 2333 | **70.15 / 3173** | 70.13 / 4995 |
## 📢 News
- **Feb 16, 2026**: We release **`DeepVision-103K`**, a large-scale, visually diverse, and verifiable multimodal mathematical dataset for advancing multimodal reasoning via RLVR.
## 📦 Resource
- 🧩 Training data: [`DeepVision-103K`](https://huggingface.co/datasets/skylenage/DeepVision-103K)
- 💻 Code: [`DeepVision-103K`](https://github.com/SKYLENAGE-AI/DeepVision-103K)
- 📄 Paper: [DeepVision-103K: A Visually Diverse, Broad-Coverage, and Verifiable Mathematical Dataset for Multimodal Reasoning](https://huggingface.co/papers/2602.16742)
## 📝 Overview
**`DeepVision-103K`** is a dataset designed for LMM Reasoning, curated from diverse real-world K12 educational sources. Key features include:
**1. Visual Diversity**: DeepVision-103K covers planar geometry, solid geometry, analytic plots, data charts, schematic diagrams, and real-world items in mathematical contexts.
<div align="center"> <img src="./assets/visual_elements.png" width="100%"/>
<sub>Visual elements in DeepVision-103K</sub> </div>
Within each category, DeepVision offers richer element types than existing open-source datasets.
<div align="center"> <img src="./assets/ve3.png" width="100%"/>
<sub>The number of different visual element types across training datasets.</sub> </div>
**2. Broad Coverage**: DeepVision-103K spans Geometry, Algebra, Probability & Statistics, and Fundamental Mathematical Skills.
<div align="center"> <img src="./assets/domain.png" width="400"/>
<sub>Hierarchical breakdown of mathematical topics covered in DeepVision-103K.</sub> </div>
**3. Rich Data Format**: Each sample contains structured annotations to support various downstream tasks:
<div align="center"> <img src="./assets/overview.png" width="600"/>
<sub>A data sample from DeepVision-103K.</sub> </div>
- **Question & Image**: Problem statement and corresponding image.
- **Final Answer**: A unique, verifiable answer enabling rule-based reward computation in RLVR.
- **Pass Rate**: The proportion of correct responses obtained during model rollouts.
- **Topic**: Hierarchical classification of the mathematical branch.
- **Knowledge Points**: Specific mathematical concepts, theorems, or techniques required.
- **Visual Elements**: Geometric or graphical objects depicted in the image.
## Curation Pipeline
A three-stage pipeline transforms diverse but noisy real-world K12 problems into structured and verifiable QA pairs:
- **Validity Filtering**: Remove problems unsuitable for RL (proof-based, descriptive, multi-answer questions).
- **Difficulty Filtering**: Calibrate sample difficulty via model rollout pass rates.
- **Query Correctness Verification**: Validate image-question pairs and answers using Gemini-3-Flash.
<div align="center"> <img src="./assets/pipeline.png" width="600"/>
<sub>Curation pipeline for mathematical data in DeepVision-103K.</sub> </div>
## 📊 Main Results
Training on DeepVision-103K yields **top performance** on both multimodal mathematical reasoning and general multimodal benchmarks:
<div align="center"> <img src="./assets/perf.png" width="100%"/>
<sub>Average Performance on multimodal math and general multimodal benchmarks.</sub> </div>
<div align="center"> <img src="./assets/bench_results.png" width="600"/>
<sub>Specific Performance on multimodal math and general multimodal benchmarks.</sub> </div>
## DeepVision-103k Training & Evaluation Toolkit
We use [GSPO](https://arxiv.org/abs/2507.18071) for training and [vllm](https://github.com/vllm-project/vllm) for async batch evaluation. The training code is built on top of [verl](https://github.com/volcengine/verl). We use [swanlab](https://github.com/SwanHubX/SwanLab) for experiment tracking.
### Installation
#### Recommended Environment
We recommend the following environment configuration:
- CUDA 12.8
- PyTorch 2.8.0
- vLLM 0.11.0
- Transformers 4.57.1
#### Setup Steps
```bash
# Clone the repo
git clone https://github.com/SKYLENAGE-AI/DeepVision-103K && cd DeepVision-103K
# Install mathverify for rule-based verification
pip install mathverify
# Install qwen_vl_utils for model training
pip install qwen_vl_utils
# Install verl in editable mode
pip install -e .
```
---
### Training
Two training templates are provided under `train_scripts/`. Both use the GSPO algorithm with GRPO advantage estimation.
#### Quick Start
1. **Search for `{YOUR_`** in the script to find all placeholders that need to be filled in:
| Placeholder | Description |
|---|---|
| `{YOUR_SWANLAB_API_KEY}` | Your SwanLab API key (for experiment tracking) |
| `{YOUR_PROJECT_NAME}` | Project name for experiment grouping |
| `{YOUR_BASE_MODEL}` | Base model identifier (used in experiment naming) |
| `{YOUR_ROOT_PATH}` | Root directory for saving checkpoints |
| `{YOUR_MODEL_PATH}` | Path to the pretrained model (e.g. HuggingFace format) |
| `{YOUR_TRAIN_FILE}` | Path to training data (`.parquet` format) |
| `{YOUR_TEST_FILE}` | Path to validation data (`.parquet` format) |
2. **Uncomment the GPU setting block** that matches your cluster size (8 / 16 / 32 / 64 GPUs).
3. **Run the script.**
#### Single-Node Training (8/16 GPUs on one machine)
```bash
bash train_scripts/train_single_node_template.sh
```
#### Multi-Node Training (Ray cluster across multiple machines)
```bash
# Submit to each node via your job scheduler
# Environment variables RANK, WORLD_SIZE, MASTER_ADDR must be set by the scheduler
bash train_scripts/train_multi_node_template.sh
```
### Evaluation
The evaluation pipeline under `eval_scripts/` provides inference and evaluation scripts.
#### Inference
1. **Fill in placeholders** in `caller.sh`:
```bash
python caller_async.py \
--model /path/to/your/model \
--input /path/to/input.jsonl \
--output /path/to/output.jsonl \
--hyperparam mimo \
--prompt-field prompt \
--gpu-devices "0,1,2,3,4,5,6,7" \
--tensor-parallel-size 1 \
--data-parallel-size 8 \
--concurrent-per-endpoint 16 \
--max-tokens 16384 \
--n 8
```
2. **Run:**
```bash
cd eval_scripts
bash caller.sh
```
### Post-Inference Evaluation
After inference is complete, use the evaluation tools under `eval_scripts/evaluation/` to score and analyze results.
#### Step 1: Math-Verify Rule-Based Evaluation
Run the math-verify judge to compute accuracy and automatically export error cases:
```bash
python eval_scripts/evaluation/mathverify_judge.py -i /path/to/your_output.jsonl
```
#### Step 2: GPT-5-mini Re-Judge on Error Cases
For the exported error cases (`*_mathverify_error.jsonl`), use GPT-5-mini as a secondary judge to catch false negatives from rule-based matching.
The judge prompt template is defined in `eval_scripts/evaluation/gpt5-mini-judge_prompt.md`.
## 📖 Citation
```bibtex
@misc{sun2026deepvision103kvisuallydiversebroadcoverage,
title={DeepVision-103K: A Visually Diverse, Broad-Coverage, and Verifiable Mathematical Dataset for Multimodal Reasoning},
author={Haoxiang Sun and Lizhen Xu and Bing Zhao and Wotao Yin and Wei Wang and Boyu Yang and Rui Wang and Hu Wei},
year={2026},
eprint={2602.16742},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2602.16742},
}
```
## 🙏 Acknowledgements
This work builds upon the following resources:
- **[MM-MathInstruct-3M](https://huggingface.co/datasets/MathLLMs/MM-MathInstruct)**: Large-scale multimodal math instruction data from real educational contexts.
- **[MultiMath-300K](https://huggingface.co/datasets/pengshuai-rin/multimath-300k)**: Multimodal mathematical dataset from real educational contexts.
- **[Zebra-CoT](https://huggingface.co/datasets/multimodal-reasoning-lab/Zebra-CoT)**: Visual logic reasoning problems.
- **[GameQA](https://huggingface.co/datasets/OpenMOSS-Team/GameQA-140K)**: Game-based visual reasoning tasks.
language:
- 英语
license: MIT协议
size_categories:
- 10万 < n < 100万
task_categories:
- 图像-文本转文本
pretty_name: DeepVision-103K
tags:
- 数学
- 多模态
- 推理
- 强化学习(Reinforcement Learning,RL)
configs:
- config_name: visual_logic
data_files:
- split: train
path: visual_logic-26k.parquet
- config_name: math
data_files:
- split: train
path: math-77k.parquet
---
<div align="center">
# 🔭 DeepVision-103K
<div>
一款面向多模态推理的视觉多样化、覆盖广泛且可验证的数学数据集
</div>
</div>
<div>
<br>
<div align="center">
[](https://huggingface.co/datasets/skylenage/DeepVision-103K)
[](https://github.com/SKYLENAGE-AI/DeepVision-103K)
[](https://huggingface.co/papers/2602.16742)
</div>
</div>
在DeepVision-103K上进行训练,可在多模态数学推理与通用多模态基准测试中均取得**顶尖性能**:
<div align="center"> <img src="./assets/perf.png" width="100%"/>
<sub>多模态数学与通用多模态基准测试的平均性能表现。</sub> </div>
在DeepVision-103K上训练可催生更高效的推理过程。
| 基准测试 | Qwen3-VL-8B-Instruct(准确率 / Token数) | Qwen3-VL-8B-DeepVision(准确率 / Token数) | Qwen3-VL-8B-Thinking(准确率 / Token数) |
| ----------- | ----------------------------------- | ------------------------------------- | ----------------------------------- |
| WeMath | 79.36 / 1428 | 85.11 / 2010 | 84.54 / 3754 |
| MathVision | 51.44 / 4288 | 55.49 / 5738 | 57.89 / 8970 |
| MathVerse | 67.38 / 1572 | 72.46 / 2714 | 72.84 / 4665 |
| LogicVista | 61.16 / 1769 | 64.73 / 2716 | 64.73 / 6115 |
| MMMU_val | 67.66 / 2099 | 71.33 / 2758 | 69.33 / 5082 |
| MMMU_Pro | 67.69 / 2170 | 70.29 / 2895 | 70.29 / 5037 |
| M³CoT | 70.83 / 1029 | 71.61 / 1294 | 71.31 / 2761 |
| **平均** | 66.50 / 2333 | **70.15 / 3173** | 70.13 / 4995 |
## 📢 动态
- **2026年2月16日**:我们发布了**`DeepVision-103K`**,一款面向通过RLVR推进多模态推理的大规模、视觉多样化且可验证的多模态数学数据集。
## 📦 资源
- 🧩 训练数据:[`DeepVision-103K`](https://huggingface.co/datasets/skylenage/DeepVision-103K)
- 💻 代码:[`DeepVision-103K`](https://github.com/SKYLENAGE-AI/DeepVision-103K)
- 📄 论文:[《DeepVision-103K:一款面向多模态推理的视觉多样化、覆盖广泛且可验证的数学数据集》](https://huggingface.co/papers/2602.16742)
## 📝 概述
**`DeepVision-103K`**是一款为大语言模型视觉(Large Multimodal Model,LMM)推理设计的数据集,源自多样化的真实K12教育资源。其核心特性如下:
**1. 视觉多样性**:DeepVision-103K涵盖数学场景中的平面几何、立体几何、解析图表、数据统计图、示意图以及真实物品。
<div align="center"> <img src="./assets/visual_elements.png" width="100%"/>
<sub>DeepVision-103K中的视觉元素</sub> </div>
在每个类别中,DeepVision提供的元素类型均多于现有开源数据集。
<div align="center"> <img src="./assets/ve3.png" width="100%"/>
<sub>各训练数据集的不同视觉元素类型数量</sub> </div>
**2. 覆盖广泛**:DeepVision-103K涵盖几何、代数、概率与统计以及基础数学技能四大领域。
<div align="center"> <img src="./assets/domain.png" width="400"/>
<sub>DeepVision-103K涵盖的数学主题分层拆解</sub> </div>
**3. 丰富的数据格式**:每个样本均包含结构化标注,可支撑各类下游任务:
<div align="center"> <img src="./assets/overview.png" width="600"/>
<sub>DeepVision-103K中的一个数据样本</sub> </div>
- **问题与图像**:问题描述与对应图像。
- **最终答案**:唯一且可验证的答案,可用于在RLVR中基于规则计算奖励。
- **通过率**:模型推理过程中获得正确响应的比例。
- **主题**:数学分支的分层分类。
- **知识点**:所需的特定数学概念、定理或技巧。
- **视觉元素**:图像中描绘的几何或图形对象。
## 数据构建流水线
我们采用三阶段流水线,将多样化但存在噪声的真实K12问题转化为结构化且可验证的问答对:
- **有效性过滤**:移除不适合强化学习(Reinforcement Learning,RL)的问题(如基于证明、描述性、多答案的问题)。
- **难度过滤**:通过模型推理的通过率校准样本难度。
- **查询正确性验证**:使用Gemini-3-Flash验证图像-问题对与答案的正确性。
<div align="center"> <img src="./assets/pipeline.png" width="600"/>
<sub>DeepVision-103K中数学数据的构建流水线</sub> </div>
## 📊 主要实验结果
在DeepVision-103K上进行训练,可在多模态数学推理与通用多模态基准测试中均取得**顶尖性能**:
<div align="center"> <img src="./assets/perf.png" width="100%"/>
<sub>多模态数学与通用多模态基准测试的平均性能表现。</sub> </div>
<div align="center"> <img src="./assets/bench_results.png" width="600"/>
<sub>多模态数学与通用多模态基准测试的具体性能表现。</sub> </div>
## DeepVision-103K 训练与评估工具包
我们使用[GSPO](https://arxiv.org/abs/2507.18071)进行训练,使用[vllm](https://github.com/vllm-project/vllm)进行异步批量评估。训练代码基于[verl](https://github.com/volcengine/verl)开发,我们使用[swanlab](https://github.com/SwanHubX/SwanLab)进行实验追踪。
### 环境配置
#### 推荐环境
我们推荐以下环境配置:
- CUDA 12.8
- PyTorch 2.8.0
- vLLM 0.11.0
- Transformers 4.57.1
#### 安装步骤
bash
# 克隆代码仓库
git clone https://github.com/SKYLENAGE-AI/DeepVision-103K && cd DeepVision-103K
# 安装mathverify以支持基于规则的验证
pip install mathverify
# 安装qwen_vl_utils以支持模型训练
pip install qwen_vl_utils
# 以可编辑模式安装verl
pip install -e .
---
### 训练
`train_scripts/`目录下提供了两种训练模板,均采用结合GRPO优势估计的GSPO算法。
#### 快速开始
1. **在脚本中搜索`{YOUR_`**,找到所有需要填写的占位符:
| 占位符 | 说明 |
|---|---|
| `{YOUR_SWANLAB_API_KEY}` | 用于实验追踪的SwanLab API密钥 |
| `{YOUR_PROJECT_NAME}` | 用于实验分组的项目名称 |
| `{YOUR_BASE_MODEL}` | 基础模型标识符(用于实验命名) |
| `{YOUR_ROOT_PATH}` | 保存模型检查点的根目录 |
| `{YOUR_MODEL_PATH}` | 预训练模型的路径(例如HuggingFace格式) |
| `{YOUR_TRAIN_FILE}` | 训练数据的路径(`.parquet`格式) |
| `{YOUR_TEST_FILE}` | 验证数据的路径(`.parquet`格式) |
2. **取消与你的集群规模(8/16/32/64张GPU)匹配的GPU设置块的注释**。
3. **运行脚本**。
#### 单节点训练(单机器上8/16张GPU)
bash
bash train_scripts/train_single_node_template.sh
#### 多节点训练(跨多机器的Ray集群)
bash
# 通过作业调度器提交到每个节点
# 环境变量RANK、WORLD_SIZE、MASTER_ADDR需由调度器设置
bash train_scripts/train_multi_node_template.sh
### 评估
`eval_scripts/`目录下的评估流水线提供了推理与评估脚本。
#### 推理
1. **在`caller.sh`中填写占位符**:
bash
python caller_async.py
--model /path/to/your/model
--input /path/to/input.jsonl
--output /path/to/output.jsonl
--hyperparam mimo
--prompt-field prompt
--gpu-devices "0,1,2,3,4,5,6,7"
--tensor-parallel-size 1
--data-parallel-size 8
--concurrent-per-endpoint 16
--max-tokens 16384
--n 8
2. **运行:**
bash
cd eval_scripts
bash caller.sh
### 推理后评估
推理完成后,使用`eval_scripts/evaluation/`下的评估工具对结果进行评分与分析。
#### 步骤1:基于规则的Math-Verify评估
运行math-verify评估器以计算准确率,并自动导出错误样本:
bash
python eval_scripts/evaluation/mathverify_judge.py -i /path/to/your_output.jsonl
#### 步骤2:对错误样本使用GPT-5-mini二次评估
对于导出的错误样本(`*_mathverify_error.jsonl`),使用GPT-5-mini作为二次评估器,以捕获基于规则匹配产生的假阴性结果。
评估提示模板定义于`eval_scripts/evaluation/gpt5-mini-judge_prompt.md`中。
## 📖 引用
bibtex
@misc{sun2026deepvision103kvisuallydiversebroadcoverage,
title={DeepVision-103K: A Visually Diverse, Broad-Coverage, and Verifiable Mathematical Dataset for Multimodal Reasoning},
author={Haoxiang Sun and Lizhen Xu and Bing Zhao and Wotao Yin and Wei Wang and Boyu Yang and Rui Wang and Hu Wei},
year={2026},
eprint={2602.16742},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2602.16742},
}
## 🙏 致谢
本研究基于以下资源构建:
- **[MM-MathInstruct-3M](https://huggingface.co/datasets/MathLLMs/MM-MathInstruct)**:源自真实教育场景的大规模多模态数学教学数据。
- **[MultiMath-300K](https://huggingface.co/datasets/pengshuai-rin/multimath-300k)**:源自真实教育场景的多模态数学数据集。
- **[Zebra-CoT](https://huggingface.co/datasets/multimodal-reasoning-lab/Zebra-CoT)**:视觉逻辑推理问题数据集。
- **[GameQA](https://huggingface.co/datasets/OpenMOSS-Team/GameQA-140K)**:基于游戏的视觉推理任务数据集。
提供机构:
JamesGoGo



