VisualWebInstruct_Verified
收藏魔搭社区2026-01-06 更新2025-11-29 收录
下载链接:
https://modelscope.cn/datasets/TIGER-Lab/VisualWebInstruct_Verified
下载链接
链接失效反馈官方服务:
资源简介:
# 🧠 VisualWebInstruct-Verified: High-Confidence Multimodal QA for Reinforcement Learning
**VisualWebInstruct-Verified** is a high-confidence subset of [VisualWebInstruct](https://huggingface.co/TIGER-AI-Lab/VisualWebInstruct), curated specifically for **Reinforcement Learning (RL)** and **Reward Model** training.
It contains *verified* multimodal question–answer pairs where correctness, reasoning quality, and image–text alignment have been explicitly validated.
This dataset is ideal for **RLVR** training pipelines.
---
## 📘 Dataset Overview
| Property | Description |
| ----------------------- | --------------------------------------------------------- |
| **Total Samples** | ≈ *973 K* |
| **Image Coverage** | 100% visual QA |
| **Primary Use** | RLVR of VLM |
---
## 🌐 Background
While **VisualWebInstruct** introduced large-scale multimodal reasoning data across mathematics, physics, engineering, chemistry, and finance,
**VisualWebInstruct-Verified** focuses on **data quality and trustworthiness**, providing reliable supervision signals for reinforcement learning.
Compared with standard instruction-tuning datasets, this subset is optimized for:
* Stable RL optimization with verified feedback
* Evaluation and curriculum-style fine-tuning by difficulty level
---
## ⚙️ Data Structure
Each entry in the dataset has the following fields:
| Field | Type | Description |
| -------------- | -------------------- | -------------------------------------------- |
| `images` | `list[str]` | Path(s) or URL(s) to associated images |
| `idx` | `string` | Unique identifier for the QA pair |
| `question` | `string` | Problem statement or instruction |
| `answer` | `string` | Full reasoning and final answer |
| `short_answer` | `string` | Concise final answer (used as target signal) |
| `difficulty` | `int` | Difficulty level from 1 to 5 |
---
## 🧩 Verification Pipeline
**VisualWebInstruct-Verified** was derived from the **image-containing portion** of the original *VisualWebInstruct* dataset and refined through an automated verification process powered by **Gemini 2.5 Pro**:
1. **Image-Based Filtering** — Only samples associated with one or more images were retained from the original dataset.
2. **Question Decomposition** — When a single sample contained multiple sub-questions, it was split into separate atomic QA pairs, each focusing on one distinct reasoning task.
3. **Gemini 2.5 Pro Verification** — Each QA pair was automatically verified by Gemini 2.5 Pro to ensure logical correctness, answer consistency, and image–text coherence.
4. **Final Consolidation** — All remaining verified pairs were reformatted and consolidated into the final *VisualWebInstruct-Verified* dataset.
---
## 📊 Relationship to VisualWebInstruct
| Dataset | Purpose | Scale | Verification | RL Ready |
| ------------------------------ | ------------------------------------------------- | ------ | ------------ | -------- |
| **VisualWebInstruct** | General multimodal instruction tuning | 906 K | ✗ | ✗ |
| **VisualWebInstruct-Verified** | High-confidence subset for RLVR | 973 K | ✓ | ✓ |
---
## 📈 Performance Impact
Fine-tuning on **VisualWebInstruct-Verified** substantially enhances multimodal reasoning capabilities.
When trained with reinforcement learning on this verified dataset, **MiMo-VL-7B-SFT + 200 steps** achieves notable improvements over both its supervised baseline and other open-source vision-language models.
| Model | MMMU-Pro (Standard 10 Options) | MMMU | MMMU-Pro (Vision) | MathVista |
| :------------------------------------ | :----------------------------: | :------: | :---------------: | :-------: |
| **MiMo-VL-7B-RL** | 46.2 | **66.7** | 40.3 | **81.5** |
| **MiMo-VL-7B-SFT-0805** | 47.2 | 58.4 | 45.1 | 78.2 |
| **InternVL3-8B** | 45.6 | 62.7 | 37.8 | 71.6 |
| **Qwen2.5-VL-7B-Instruct** | 34.7 | 58.6 | 29.4 | 61.2 |
| **MiMo-VL-7B-SFT + 200 steps (Ours)** | **51.9** | 63.7 | **47.7** | 76.4 |
### Key Observations
* **Reinforcement learning with verified data** yields consistent performance gains of **+3 – 5 absolute points** on average.
* The improvements are **more pronounced on general-domain reasoning benchmarks**—such as **MMMU** and **MMMU-Pro**—than on math-specific datasets like **MathVista**, demonstrating the dataset’s broad reasoning generalization beyond purely numerical tasks.
* The **MiMo-VL-7B-SFT + 200 steps** model trained on *VisualWebInstruct-Verified* shows stronger reasoning stability, more accurate factual grounding, and better visual-textual coherence compared to its SFT baseline.
* These findings highlight the importance of **high-confidence multimodal data** for achieving robust general reasoning through RL-based instruction optimization.
---
## Citation
If you use VisualWebInstruct in your research, please cite our paper:
```bibtex
@article{visualwebinstruct,
title = {VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search},
author = {Jia, Yiming and Li, Jiachen and Yue, Xiang and Li, Bo and Nie, Ping and Zou, Kai and Chen, Wenhu},
journal = {arXiv preprint arXiv:2503.10582},
year = {2025}
}
```
## Acknowledgements
This research was supported by NetMind.Ai for providing cloud compute and Google DeepMind for generous support for Gemini credits.
# 🧠 VisualWebInstruct-Verified:用于强化学习的高置信度多模态问答数据集
**VisualWebInstruct-Verified** 是[VisualWebInstruct](https://huggingface.co/TIGER-AI-Lab/VisualWebInstruct)的高置信性子集,专为**强化学习(Reinforcement Learning, RL)**与**奖励模型(Reward Model)**训练定制。其包含经过验证的多模态问答对,所有样本的正确性、推理质量以及图文对齐度均已得到显式校验。该数据集非常适配RLVR训练流程。
---
## 📘 数据集概览
| 属性 | 描述 |
| ----------------------- | --------------------------------------------------------- |
| **总样本数** | ≈ 97.3万 |
| **图像覆盖范围** | 100%为视觉问答 |
| **主要用途** | 视觉语言模型的RLVR训练 |
---
## 🌐 研究背景
原始VisualWebInstruct数据集涵盖了数学、物理、工程、化学与金融领域的大规模多模态推理数据,而**VisualWebInstruct-Verified**则聚焦于**数据质量与可信度**,为强化学习提供可靠的监督信号。相较于标准指令微调数据集,该子集针对以下场景进行了优化:
* 基于验证反馈的稳定强化学习优化
* 基于难度分级的评估与课程式微调
---
## ⚙️ 数据结构
数据集中的每个条目包含以下字段:
| 字段 | 类型 | 描述 |
| -------------- | -------------------- | -------------------------------------------- |
| `images` | `list[str]` | 关联图像的路径或URL列表 |
| `idx` | 字符串类型 | 问答对的唯一标识符 |
| `question` | 字符串类型 | 问题描述或指令 |
| `answer` | 字符串类型 | 完整推理过程与最终答案 |
| `short_answer` | 字符串类型 | 简洁的最终答案(用作目标监督信号) |
| `difficulty` | 整数类型 | 1至5级的难度评级 |
---
## 🧩 验证流程
**VisualWebInstruct-Verified** 源自原始VisualWebInstruct数据集的**含图像部分**,并通过基于**Gemini 2.5 Pro**的自动化验证流程进行了精细化处理:
1. **图像筛选**:仅保留原始数据集中关联一张或多张图像的样本
2. **问题拆分**:当单个样本包含多个子问题时,将其拆分为独立的原子问答对,每个配对聚焦于单一推理任务
3. **Gemini 2.5 Pro校验**:由Gemini 2.5 Pro自动校验每个问答对,确保其逻辑正确性、答案一致性与图文连贯性
4. **最终整合**:将所有验证通过的配对重新格式化并整合为最终的VisualWebInstruct-Verified数据集
---
## 📊 与VisualWebInstruct的对比
| 数据集名称 | 用途 | 规模 | 验证情况 | 适配强化学习 |
| ------------------------- | ------------------------ | ------ | -------- | ------------ |
| **VisualWebInstruct** | 通用多模态指令微调 | 90.6万 | 否 | 否 |
| **VisualWebInstruct-Verified** | 面向RLVR训练的高置信性子集 | 97.3万 | 是 | 是 |
---
## 📈 性能影响
在VisualWebInstruct-Verified上进行微调可显著提升多模态推理能力。当基于该验证数据集进行强化学习训练时,**MiMo-VL-7B-SFT + 200 steps**模型的性能相较于其监督微调基线与其他开源视觉语言模型均有显著提升。
| 模型 | MMMU-Pro(标准10选项) | MMMU | MMMU-Pro(视觉) | MathVista |
| :------------------------------------ | :-------------------: | :----: | :-------------: | :-------: |
| **MiMo-VL-7B-RL** | 46.2 | 66.7 | 40.3 | 81.5 |
| **MiMo-VL-7B-SFT-0805** | 47.2 | 58.4 | 45.1 | 78.2 |
| **InternVL3-8B** | 45.6 | 62.7 | 37.8 | 71.6 |
| **Qwen2.5-VL-7B-Instruct** | 34.7 | 58.6 | 29.4 | 61.2 |
| **MiMo-VL-7B-SFT + 200 steps(本文方法)** | **51.9** | 63.7 | **47.7** | 76.4 |
### 关键结论
1. 基于验证数据的强化学习可带来平均**3~5个绝对百分点**的性能提升
2. 相较于MathVista这类数学专用数据集,该改进在MMMU、MMMU-Pro等通用领域推理基准上更为显著,体现了该数据集在非数值任务上的泛化能力
3. 基于VisualWebInstruct-Verified训练的**MiMo-VL-7B-SFT + 200 steps**模型相较于其监督微调基线,具备更强的推理稳定性、更精准的事实锚定以及更优的图文连贯性
4. 上述结果凸显了高置信度多模态数据在通过强化学习实现通用推理优化中的重要价值。
---
## 引用
如果您在研究中使用VisualWebInstruct数据集,请引用我们的论文:
bibtex
@article{visualwebinstruct,
title = {VisualWebInstruct: 通过网页搜索扩展多模态指令数据规模},
author = {Jia, Yiming and Li, Jiachen and Yue, Xiang and Li, Bo and Nie, Ping and Zou, Kai and Chen, Wenhu},
journal = {arXiv预印本 arXiv:2503.10582},
year = {2025}
}
## 致谢
本研究得到了NetMind.Ai提供的云端算力支持,以及Google DeepMind为Gemini算力额度提供的慷慨资助。
提供机构:
maas
创建时间:
2025-10-25



