five

JamesGoGo/DeepVision-103K

收藏
Hugging Face2026-03-04 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/JamesGoGo/DeepVision-103K
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: mit size_categories: - 100K<n<1M task_categories: - image-text-to-text pretty_name: DeepVision-103K tags: - math - multimodal - reasoning - rl configs: - config_name: visual_logic data_files: - split: train path: visual_logic-26k.parquet - config_name: math data_files: - split: train path: math-77k.parquet --- <div align="center"> # 🔭 DeepVision-103K <div> A Visually Diverse, Broad-Coverage, and Verifiable Mathematical Dataset for Multimodal Reasoning </div> </div> <div> <br> <div align="center"> [![Data](https://img.shields.io/badge/Data-4d5eff?style=for-the-badge&logo=huggingface&logoColor=ffffff&labelColor)](https://huggingface.co/datasets/skylenage/DeepVision-103K) [![Github](https://img.shields.io/badge/Code-000000?style=for-the-badge&logo=github&logoColor=white)](https://github.com/SKYLENAGE-AI/DeepVision-103K) [![Paper](https://img.shields.io/badge/Paper-2602.16742-b31b1b.svg?style=for-the-badge)](https://huggingface.co/papers/2602.16742) </div> </div> Training on DeepVision-103K yields **top performance** on both multimodal mathematical reasoning and general multimodal benchmarks: <div align="center"> <img src="./assets/perf.png" width="100%"/> <sub>Average Performance on multimodal math and general multimodal benchmarks.</sub> </div> Training on DeepVision-103K elicits more efficient reasoning. | Benchmark | Qwen3-VL-8B-Instruct (Acc / Tokens) | Qwen3-VL-8B-DeepVision (Acc / Tokens) | Qwen3-VL-8B-Thinking (Acc / Tokens) | | ----------- | ----------------------------------- | ------------------------------------- | ----------------------------------- | | WeMath | 79.36 / 1428 | 85.11 / 2010 | 84.54 / 3754 | | MathVision | 51.44 / 4288 | 55.49 / 5738 | 57.89 / 8970 | | MathVerse | 67.38 / 1572 | 72.46 / 2714 | 72.84 / 4665 | | LogicVista | 61.16 / 1769 | 64.73 / 2716 | 64.73 / 6115 | | MMMU_val | 67.66 / 2099 | 71.33 / 2758 | 69.33 / 5082 | | MMMU_Pro | 67.69 / 2170 | 70.29 / 2895 | 70.29 / 5037 | | M³CoT | 70.83 / 1029 | 71.61 / 1294 | 71.31 / 2761 | | **Average** | 66.50 / 2333 | **70.15 / 3173** | 70.13 / 4995 | ## 📢 News - **Feb 16, 2026**: We release **`DeepVision-103K`**, a large-scale, visually diverse, and verifiable multimodal mathematical dataset for advancing multimodal reasoning via RLVR. ## 📦 Resource - 🧩 Training data: [`DeepVision-103K`](https://huggingface.co/datasets/skylenage/DeepVision-103K) - 💻 Code: [`DeepVision-103K`](https://github.com/SKYLENAGE-AI/DeepVision-103K) - 📄 Paper: [DeepVision-103K: A Visually Diverse, Broad-Coverage, and Verifiable Mathematical Dataset for Multimodal Reasoning](https://huggingface.co/papers/2602.16742) ## 📝 Overview **`DeepVision-103K`** is a dataset designed for LMM Reasoning, curated from diverse real-world K12 educational sources. Key features include: **1. Visual Diversity**: DeepVision-103K covers planar geometry, solid geometry, analytic plots, data charts, schematic diagrams, and real-world items in mathematical contexts. <div align="center"> <img src="./assets/visual_elements.png" width="100%"/> <sub>Visual elements in DeepVision-103K</sub> </div> Within each category, DeepVision offers richer element types than existing open-source datasets. <div align="center"> <img src="./assets/ve3.png" width="100%"/> <sub>The number of different visual element types across training datasets.</sub> </div> **2. Broad Coverage**: DeepVision-103K spans Geometry, Algebra, Probability & Statistics, and Fundamental Mathematical Skills. <div align="center"> <img src="./assets/domain.png" width="400"/> <sub>Hierarchical breakdown of mathematical topics covered in DeepVision-103K.</sub> </div> **3. Rich Data Format**: Each sample contains structured annotations to support various downstream tasks: <div align="center"> <img src="./assets/overview.png" width="600"/> <sub>A data sample from DeepVision-103K.</sub> </div> - **Question & Image**: Problem statement and corresponding image. - **Final Answer**: A unique, verifiable answer enabling rule-based reward computation in RLVR. - **Pass Rate**: The proportion of correct responses obtained during model rollouts. - **Topic**: Hierarchical classification of the mathematical branch. - **Knowledge Points**: Specific mathematical concepts, theorems, or techniques required. - **Visual Elements**: Geometric or graphical objects depicted in the image. ## Curation Pipeline A three-stage pipeline transforms diverse but noisy real-world K12 problems into structured and verifiable QA pairs: - **Validity Filtering**: Remove problems unsuitable for RL (proof-based, descriptive, multi-answer questions). - **Difficulty Filtering**: Calibrate sample difficulty via model rollout pass rates. - **Query Correctness Verification**: Validate image-question pairs and answers using Gemini-3-Flash. <div align="center"> <img src="./assets/pipeline.png" width="600"/> <sub>Curation pipeline for mathematical data in DeepVision-103K.</sub> </div> ## 📊 Main Results Training on DeepVision-103K yields **top performance** on both multimodal mathematical reasoning and general multimodal benchmarks: <div align="center"> <img src="./assets/perf.png" width="100%"/> <sub>Average Performance on multimodal math and general multimodal benchmarks.</sub> </div> <div align="center"> <img src="./assets/bench_results.png" width="600"/> <sub>Specific Performance on multimodal math and general multimodal benchmarks.</sub> </div> ## DeepVision-103k Training & Evaluation Toolkit We use [GSPO](https://arxiv.org/abs/2507.18071) for training and [vllm](https://github.com/vllm-project/vllm) for async batch evaluation. The training code is built on top of [verl](https://github.com/volcengine/verl). We use [swanlab](https://github.com/SwanHubX/SwanLab) for experiment tracking. ### Installation #### Recommended Environment We recommend the following environment configuration: - CUDA 12.8 - PyTorch 2.8.0 - vLLM 0.11.0 - Transformers 4.57.1 #### Setup Steps ```bash # Clone the repo git clone https://github.com/SKYLENAGE-AI/DeepVision-103K && cd DeepVision-103K # Install mathverify for rule-based verification pip install mathverify # Install qwen_vl_utils for model training pip install qwen_vl_utils # Install verl in editable mode pip install -e . ``` --- ### Training Two training templates are provided under `train_scripts/`. Both use the GSPO algorithm with GRPO advantage estimation. #### Quick Start 1. **Search for `{YOUR_`** in the script to find all placeholders that need to be filled in: | Placeholder | Description | |---|---| | `{YOUR_SWANLAB_API_KEY}` | Your SwanLab API key (for experiment tracking) | | `{YOUR_PROJECT_NAME}` | Project name for experiment grouping | | `{YOUR_BASE_MODEL}` | Base model identifier (used in experiment naming) | | `{YOUR_ROOT_PATH}` | Root directory for saving checkpoints | | `{YOUR_MODEL_PATH}` | Path to the pretrained model (e.g. HuggingFace format) | | `{YOUR_TRAIN_FILE}` | Path to training data (`.parquet` format) | | `{YOUR_TEST_FILE}` | Path to validation data (`.parquet` format) | 2. **Uncomment the GPU setting block** that matches your cluster size (8 / 16 / 32 / 64 GPUs). 3. **Run the script.** #### Single-Node Training (8/16 GPUs on one machine) ```bash bash train_scripts/train_single_node_template.sh ``` #### Multi-Node Training (Ray cluster across multiple machines) ```bash # Submit to each node via your job scheduler # Environment variables RANK, WORLD_SIZE, MASTER_ADDR must be set by the scheduler bash train_scripts/train_multi_node_template.sh ``` ### Evaluation The evaluation pipeline under `eval_scripts/` provides inference and evaluation scripts. #### Inference 1. **Fill in placeholders** in `caller.sh`: ```bash python caller_async.py \ --model /path/to/your/model \ --input /path/to/input.jsonl \ --output /path/to/output.jsonl \ --hyperparam mimo \ --prompt-field prompt \ --gpu-devices "0,1,2,3,4,5,6,7" \ --tensor-parallel-size 1 \ --data-parallel-size 8 \ --concurrent-per-endpoint 16 \ --max-tokens 16384 \ --n 8 ``` 2. **Run:** ```bash cd eval_scripts bash caller.sh ``` ### Post-Inference Evaluation After inference is complete, use the evaluation tools under `eval_scripts/evaluation/` to score and analyze results. #### Step 1: Math-Verify Rule-Based Evaluation Run the math-verify judge to compute accuracy and automatically export error cases: ```bash python eval_scripts/evaluation/mathverify_judge.py -i /path/to/your_output.jsonl ``` #### Step 2: GPT-5-mini Re-Judge on Error Cases For the exported error cases (`*_mathverify_error.jsonl`), use GPT-5-mini as a secondary judge to catch false negatives from rule-based matching. The judge prompt template is defined in `eval_scripts/evaluation/gpt5-mini-judge_prompt.md`. ## 📖 Citation ```bibtex @misc{sun2026deepvision103kvisuallydiversebroadcoverage, title={DeepVision-103K: A Visually Diverse, Broad-Coverage, and Verifiable Mathematical Dataset for Multimodal Reasoning}, author={Haoxiang Sun and Lizhen Xu and Bing Zhao and Wotao Yin and Wei Wang and Boyu Yang and Rui Wang and Hu Wei}, year={2026}, eprint={2602.16742}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2602.16742}, } ``` ## 🙏 Acknowledgements This work builds upon the following resources: - **[MM-MathInstruct-3M](https://huggingface.co/datasets/MathLLMs/MM-MathInstruct)**: Large-scale multimodal math instruction data from real educational contexts. - **[MultiMath-300K](https://huggingface.co/datasets/pengshuai-rin/multimath-300k)**: Multimodal mathematical dataset from real educational contexts. - **[Zebra-CoT](https://huggingface.co/datasets/multimodal-reasoning-lab/Zebra-CoT)**: Visual logic reasoning problems. - **[GameQA](https://huggingface.co/datasets/OpenMOSS-Team/GameQA-140K)**: Game-based visual reasoning tasks.

language: - 英语 license: MIT协议 size_categories: - 10万 < n < 100万 task_categories: - 图像-文本转文本 pretty_name: DeepVision-103K tags: - 数学 - 多模态 - 推理 - 强化学习(Reinforcement Learning,RL) configs: - config_name: visual_logic data_files: - split: train path: visual_logic-26k.parquet - config_name: math data_files: - split: train path: math-77k.parquet --- <div align="center"> # 🔭 DeepVision-103K <div> 一款面向多模态推理的视觉多样化、覆盖广泛且可验证的数学数据集 </div> </div> <div> <br> <div align="center"> [![数据集](https://img.shields.io/badge/Data-4d5eff?style=for-the-badge&logo=huggingface&logoColor=ffffff&labelColor)](https://huggingface.co/datasets/skylenage/DeepVision-103K) [![代码](https://img.shields.io/badge/Code-000000?style=for-the-badge&logo=github&logoColor=white)](https://github.com/SKYLENAGE-AI/DeepVision-103K) [![论文](https://img.shields.io/badge/Paper-2602.16742-b31b1b.svg?style=for-the-badge)](https://huggingface.co/papers/2602.16742) </div> </div> 在DeepVision-103K上进行训练,可在多模态数学推理与通用多模态基准测试中均取得**顶尖性能**: <div align="center"> <img src="./assets/perf.png" width="100%"/> <sub>多模态数学与通用多模态基准测试的平均性能表现。</sub> </div> 在DeepVision-103K上训练可催生更高效的推理过程。 | 基准测试 | Qwen3-VL-8B-Instruct(准确率 / Token数) | Qwen3-VL-8B-DeepVision(准确率 / Token数) | Qwen3-VL-8B-Thinking(准确率 / Token数) | | ----------- | ----------------------------------- | ------------------------------------- | ----------------------------------- | | WeMath | 79.36 / 1428 | 85.11 / 2010 | 84.54 / 3754 | | MathVision | 51.44 / 4288 | 55.49 / 5738 | 57.89 / 8970 | | MathVerse | 67.38 / 1572 | 72.46 / 2714 | 72.84 / 4665 | | LogicVista | 61.16 / 1769 | 64.73 / 2716 | 64.73 / 6115 | | MMMU_val | 67.66 / 2099 | 71.33 / 2758 | 69.33 / 5082 | | MMMU_Pro | 67.69 / 2170 | 70.29 / 2895 | 70.29 / 5037 | | M³CoT | 70.83 / 1029 | 71.61 / 1294 | 71.31 / 2761 | | **平均** | 66.50 / 2333 | **70.15 / 3173** | 70.13 / 4995 | ## 📢 动态 - **2026年2月16日**:我们发布了**`DeepVision-103K`**,一款面向通过RLVR推进多模态推理的大规模、视觉多样化且可验证的多模态数学数据集。 ## 📦 资源 - 🧩 训练数据:[`DeepVision-103K`](https://huggingface.co/datasets/skylenage/DeepVision-103K) - 💻 代码:[`DeepVision-103K`](https://github.com/SKYLENAGE-AI/DeepVision-103K) - 📄 论文:[《DeepVision-103K:一款面向多模态推理的视觉多样化、覆盖广泛且可验证的数学数据集》](https://huggingface.co/papers/2602.16742) ## 📝 概述 **`DeepVision-103K`**是一款为大语言模型视觉(Large Multimodal Model,LMM)推理设计的数据集,源自多样化的真实K12教育资源。其核心特性如下: **1. 视觉多样性**:DeepVision-103K涵盖数学场景中的平面几何、立体几何、解析图表、数据统计图、示意图以及真实物品。 <div align="center"> <img src="./assets/visual_elements.png" width="100%"/> <sub>DeepVision-103K中的视觉元素</sub> </div> 在每个类别中,DeepVision提供的元素类型均多于现有开源数据集。 <div align="center"> <img src="./assets/ve3.png" width="100%"/> <sub>各训练数据集的不同视觉元素类型数量</sub> </div> **2. 覆盖广泛**:DeepVision-103K涵盖几何、代数、概率与统计以及基础数学技能四大领域。 <div align="center"> <img src="./assets/domain.png" width="400"/> <sub>DeepVision-103K涵盖的数学主题分层拆解</sub> </div> **3. 丰富的数据格式**:每个样本均包含结构化标注,可支撑各类下游任务: <div align="center"> <img src="./assets/overview.png" width="600"/> <sub>DeepVision-103K中的一个数据样本</sub> </div> - **问题与图像**:问题描述与对应图像。 - **最终答案**:唯一且可验证的答案,可用于在RLVR中基于规则计算奖励。 - **通过率**:模型推理过程中获得正确响应的比例。 - **主题**:数学分支的分层分类。 - **知识点**:所需的特定数学概念、定理或技巧。 - **视觉元素**:图像中描绘的几何或图形对象。 ## 数据构建流水线 我们采用三阶段流水线,将多样化但存在噪声的真实K12问题转化为结构化且可验证的问答对: - **有效性过滤**:移除不适合强化学习(Reinforcement Learning,RL)的问题(如基于证明、描述性、多答案的问题)。 - **难度过滤**:通过模型推理的通过率校准样本难度。 - **查询正确性验证**:使用Gemini-3-Flash验证图像-问题对与答案的正确性。 <div align="center"> <img src="./assets/pipeline.png" width="600"/> <sub>DeepVision-103K中数学数据的构建流水线</sub> </div> ## 📊 主要实验结果 在DeepVision-103K上进行训练,可在多模态数学推理与通用多模态基准测试中均取得**顶尖性能**: <div align="center"> <img src="./assets/perf.png" width="100%"/> <sub>多模态数学与通用多模态基准测试的平均性能表现。</sub> </div> <div align="center"> <img src="./assets/bench_results.png" width="600"/> <sub>多模态数学与通用多模态基准测试的具体性能表现。</sub> </div> ## DeepVision-103K 训练与评估工具包 我们使用[GSPO](https://arxiv.org/abs/2507.18071)进行训练,使用[vllm](https://github.com/vllm-project/vllm)进行异步批量评估。训练代码基于[verl](https://github.com/volcengine/verl)开发,我们使用[swanlab](https://github.com/SwanHubX/SwanLab)进行实验追踪。 ### 环境配置 #### 推荐环境 我们推荐以下环境配置: - CUDA 12.8 - PyTorch 2.8.0 - vLLM 0.11.0 - Transformers 4.57.1 #### 安装步骤 bash # 克隆代码仓库 git clone https://github.com/SKYLENAGE-AI/DeepVision-103K && cd DeepVision-103K # 安装mathverify以支持基于规则的验证 pip install mathverify # 安装qwen_vl_utils以支持模型训练 pip install qwen_vl_utils # 以可编辑模式安装verl pip install -e . --- ### 训练 `train_scripts/`目录下提供了两种训练模板,均采用结合GRPO优势估计的GSPO算法。 #### 快速开始 1. **在脚本中搜索`{YOUR_`**,找到所有需要填写的占位符: | 占位符 | 说明 | |---|---| | `{YOUR_SWANLAB_API_KEY}` | 用于实验追踪的SwanLab API密钥 | | `{YOUR_PROJECT_NAME}` | 用于实验分组的项目名称 | | `{YOUR_BASE_MODEL}` | 基础模型标识符(用于实验命名) | | `{YOUR_ROOT_PATH}` | 保存模型检查点的根目录 | | `{YOUR_MODEL_PATH}` | 预训练模型的路径(例如HuggingFace格式) | | `{YOUR_TRAIN_FILE}` | 训练数据的路径(`.parquet`格式) | | `{YOUR_TEST_FILE}` | 验证数据的路径(`.parquet`格式) | 2. **取消与你的集群规模(8/16/32/64张GPU)匹配的GPU设置块的注释**。 3. **运行脚本**。 #### 单节点训练(单机器上8/16张GPU) bash bash train_scripts/train_single_node_template.sh #### 多节点训练(跨多机器的Ray集群) bash # 通过作业调度器提交到每个节点 # 环境变量RANK、WORLD_SIZE、MASTER_ADDR需由调度器设置 bash train_scripts/train_multi_node_template.sh ### 评估 `eval_scripts/`目录下的评估流水线提供了推理与评估脚本。 #### 推理 1. **在`caller.sh`中填写占位符**: bash python caller_async.py --model /path/to/your/model --input /path/to/input.jsonl --output /path/to/output.jsonl --hyperparam mimo --prompt-field prompt --gpu-devices "0,1,2,3,4,5,6,7" --tensor-parallel-size 1 --data-parallel-size 8 --concurrent-per-endpoint 16 --max-tokens 16384 --n 8 2. **运行:** bash cd eval_scripts bash caller.sh ### 推理后评估 推理完成后,使用`eval_scripts/evaluation/`下的评估工具对结果进行评分与分析。 #### 步骤1:基于规则的Math-Verify评估 运行math-verify评估器以计算准确率,并自动导出错误样本: bash python eval_scripts/evaluation/mathverify_judge.py -i /path/to/your_output.jsonl #### 步骤2:对错误样本使用GPT-5-mini二次评估 对于导出的错误样本(`*_mathverify_error.jsonl`),使用GPT-5-mini作为二次评估器,以捕获基于规则匹配产生的假阴性结果。 评估提示模板定义于`eval_scripts/evaluation/gpt5-mini-judge_prompt.md`中。 ## 📖 引用 bibtex @misc{sun2026deepvision103kvisuallydiversebroadcoverage, title={DeepVision-103K: A Visually Diverse, Broad-Coverage, and Verifiable Mathematical Dataset for Multimodal Reasoning}, author={Haoxiang Sun and Lizhen Xu and Bing Zhao and Wotao Yin and Wei Wang and Boyu Yang and Rui Wang and Hu Wei}, year={2026}, eprint={2602.16742}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2602.16742}, } ## 🙏 致谢 本研究基于以下资源构建: - **[MM-MathInstruct-3M](https://huggingface.co/datasets/MathLLMs/MM-MathInstruct)**:源自真实教育场景的大规模多模态数学教学数据。 - **[MultiMath-300K](https://huggingface.co/datasets/pengshuai-rin/multimath-300k)**:源自真实教育场景的多模态数学数据集。 - **[Zebra-CoT](https://huggingface.co/datasets/multimodal-reasoning-lab/Zebra-CoT)**:视觉逻辑推理问题数据集。 - **[GameQA](https://huggingface.co/datasets/OpenMOSS-Team/GameQA-140K)**:基于游戏的视觉推理任务数据集。
提供机构:
JamesGoGo
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作