VGR
收藏魔搭社区2026-01-07 更新2025-09-27 收录
下载链接:
https://modelscope.cn/datasets/BytedanceDouyinContent/VGR
下载链接
链接失效反馈官方服务:
资源简介:
# VGR-SFT: Dataset for Visual Grounded Reasoning
- [Arxiv Paper Link](https://arxiv.org/pdf/2506.11991)
- [Data Repository](https://huggingface.co/datasets/BytedanceDouyinContent/VGR)
## Dataset Overview
VGR-SFT (Visual Grounded Reasoning - Supervised Fine-Tuning) is a large-scale multimodal reasoning dataset associated with the paper "VGR: Visual Grounded Reasoning". This dataset marks the first attempt to explicitly model visual region attention in multimodal reasoning, containing reasoning data with mixed vision grounding and language deduction. It enables models to autonomously attend to arbitrary visual regions during the reasoning process.
## Key Features
- **Joint Visual-Language Reasoning**: Each sample includes an image, question, reasoning chain, and answer, with annotations of visual regions relevant to the reasoning.
- **Autonomous Region Attention**: Grounding areas in the dataset are voluntarily generated by models, avoiding manual annotation bias.
- **Diverse Domain Coverage**: Includes various task types such as science question answering, chart understanding, and document visual question answering.
- **Efficient Feature Utilization**: Reduces visual token consumption by 70% compared to baselines through a selective feature replay mechanism.
## Dataset Structure
### Data Composition
| Subdataset | Size | Task Type |
|--------------|--------|-----------------|
| AI2D | 12.5k | Science QA |
| LLaVA-COCO | 12.3k | General VQA |
| GQA | 39.2k | General VQA |
| ChartQA | 11.2k | OCR |
| DVQA | 25.2k | OCR |
| DocVQA | 6.0k | OCR |
| OCRVQA | 51.6k | OCR |
| **Total** | **158.1k**| - |
Due to copyright restrictions, we do not provide the image source files directly. You can simply download the required images from the official dataset provided by [LLaVA-NeXT](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data). The images in the `llava_next_raw_format` directory will be the relative path of the 'image' field mentioned in the VGR-SFT data.
We currently release a 50k preview version of our dataset, full data will be released latter. The data includes a short version and a long version, the short version is rewriten from the long version to reduce training difficulty.
## Data Construction Pipeline
1. **Cold-Start Data Generation**: Initial reasoning data with visual region annotations generated using the Qwen2.5-VL-72B model.
2. **Reject Sampling Refinement**:
- Format Verification: Ensures answer parseability and correct coordinate formatting.
- Correctness Verification: Validates reasoning accuracy via ANLS and commercial model APIs.
- Visual Grounding Verification: Crops regions and verifies content alignment with annotations.
3. **Data Scaling**: Trains an annotation model using InternVL3-14B, integrating Open-R1 text reasoning data to enhance generalization, and rewrite training data with a comercial model.
## Model Performance with VGR

## Data Example

## Citation
If you use this dataset, please cite the following paper:
```bibtex
@article{wang2025vgr,
title={VGR: Visual Grounded Reasoning},
author={Jiacong Wang and Zijian Kang and Haochen Wang and Haiyong Jiang and Jiawen Li and Bohong Wu and Ya Wang and Jiao Ran and Xiao Liang and Chao Feng and Jun Xiao},
journal={arXiv preprint arXiv:2506.11991},
year={2025}
}
```
## License
This dataset is released under the [Creative Commons Zero v1.0 Universal (CC-0)](https://creativecommons.org/publicdomain/zero/1.0/) license, subject to any intellectual property rights in the dataset owned by Bytedance. The data is adapted from the LLaVA-Next project, your use of that data must comply with their respective licenses. Please see the [disclaimer](./Disclaimer.txt) for more details.
# VGR-SFT:视觉锚定推理数据集(Visual Grounded Reasoning)
- [Arxiv 论文链接](https://arxiv.org/pdf/2506.11991)
- [数据集仓库](https://huggingface.co/datasets/BytedanceDouyinContent/VGR)
## 数据集概览
VGR-SFT(视觉锚定推理-监督微调,Visual Grounded Reasoning - Supervised Fine-Tuning)是与论文《VGR:视觉锚定推理》相关联的大规模多模态推理数据集。本数据集首次尝试在多模态推理任务中显式建模视觉区域注意力机制,涵盖融合视觉锚定与语言演绎的推理数据,可支持模型在推理过程中自主关注任意视觉区域。
## 核心特性
- **联合视觉-语言推理**:每个样本均包含图像、问题、推理链与答案,并标注了与推理过程相关的视觉区域。
- **自主区域注意力**:数据集中的锚定区域由模型自主生成,规避了人工标注带来的偏差问题。
- **多领域覆盖**:包含科学问答、图表理解、文档视觉问答等多种任务类型。
- **高效特征利用**:通过选择性特征重放机制,相较基线模型可减少70%的视觉Token(Token)消耗。
## 数据集结构
### 数据组成
| 子数据集 | 样本量 | 任务类型 |
|------------|---------|------------------|
| AI2D | 12.5k | 科学问答 |
| LLaVA-COCO | 12.3k | 通用视觉问答 |
| GQA | 39.2k | 通用视觉问答 |
| ChartQA | 11.2k | 光学字符识别(OCR) |
| DVQA | 25.2k | 光学字符识别(OCR) |
| DocVQA | 6.0k | 光学字符识别(OCR) |
| OCRVQA | 51.6k | 光学字符识别(OCR) |
| **总计** | **158.1k** | - |
由于版权限制,我们未直接提供图像源文件。您可直接从[LLaVA-NeXT](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data)提供的官方数据集下载所需图像。`llava_next_raw_format`目录中的图像路径即为VGR-SFT数据中`image`字段对应的相对路径。
我们目前已发布5万条的数据集预览版本,完整数据集将在后续推出。本次发布的数据包含短版本与长版本,其中短版本由长版本改写而来,以降低训练难度。
## 数据构建流程
1. **冷启动数据生成**:使用Qwen2.5-VL-72B模型生成带有视觉区域标注的初始推理数据。
2. **拒绝采样优化**:
- 格式校验:确保答案可解析且坐标格式符合规范。
- 正确性校验:通过ANLS指标与商用模型API验证推理结果的准确性。
- 视觉锚定校验:裁剪对应区域并验证其内容与标注是否匹配。
3. **数据规模化**:使用InternVL3-14B训练标注模型,集成Open-R1文本推理数据以提升模型泛化能力,并使用商用模型改写训练数据。
## 基于VGR的模型性能

## 数据示例

## 引用说明
若您使用本数据集,请引用以下论文:
bibtex
@article{wang2025vgr,
title={VGR: Visual Grounded Reasoning},
author={Jiacong Wang and Zijian Kang and Haochen Wang and Haiyong Jiang and Jiawen Li and Bohong Wu and Ya Wang and Jiao Ran and Xiao Liang and Chao Feng and Jun Xiao},
journal={arXiv preprint arXiv:2506.11991},
year={2025}
}
## 许可协议
本数据集采用[知识共享零通用1.0(Creative Commons Zero v1.0 Universal,CC-0)](https://creativecommons.org/publicdomain/zero/1.0/)协议发布,同时受字节跳动拥有的数据集相关知识产权约束。本数据改编自LLaVA-Next项目,您使用该数据需遵守其对应的许可协议。更多详情请参阅[免责声明](./Disclaimer.txt).
提供机构:
maas
创建时间:
2025-09-23
搜集汇总
数据集介绍

以上内容由遇见数据集搜集并总结生成



