Tennet7/ZwZ-RL-VQA
收藏Hugging Face2026-04-28 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/Tennet7/ZwZ-RL-VQA
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: apache-2.0
tags:
- multimodal
- vision-language-model
- fine-grained-perception
- vqa
- region-to-image-distillation
datasets:
- sa-1b
- laion
- visual-genome
- cc12m
size_categories:
- 10K<n<100K
---
# ZwZ-RL-VQA: Region-to-Image Distilled Training Data for Fine-Grained Perception
This dataset contains **74K high-quality VQA pairs** generated via **Region-to-Image Distillation (R2I)** for training multimodal large language models (MLLMs) on fine-grained perception tasks without test-time tool use.
## 📖 Overview
The **Zooming without Zooming (ZwZ)** method transforms "zooming" from an inference-time tool into a training-time primitive:
1. **Zoom-in Synthesis**: Strong teacher models (Qwen3-VL-235B, GLM-4.5V) generate questions and answers on micro-cropped regions where fine details are unambiguous
2. **Zoom-out Distillation**: Region-grounded supervision is distilled back to full images with explicit bounding-box overlays
3. **Single-Pass Inference**: Trained models internalize zooming benefits, achieving fine-grained perception in one forward pass
## 📊 Dataset Statistics
| Attribute | Value |
|-----------|-------|
| **Total Samples** | 74,000 |
| **Source Images** | SA-1B, LAION, MetaCLIP, Visual Genome, CC12M, STPLS3D |
| **Image Resolution** | Mostly > 1000×1000 (high-resolution) |
| **Crop Ratio** | mostly < 10% of full image area (fine-grained focus) |
| **Question Types** | Counting, OCR, Color, Structure, Material, Identification |
| **Consensus Filter** | >6/8 agreement among teacher ensembles |
## 🏗️ Data Generation Pipeline
### Teachers Used
| Role | Model |
|------|-------|
| **Question Generator** | Qwen3-VL-235B-A22B-Instruct |
| **Answer Generator 1** | Qwen3-VL-235B-A22B-Instruct |
| **Answer Generator 2** | GLM-4.5V |
### Quality Control
- ✅ **Consensus Filtering**: Only retain QA pairs with >75% teacher agreement (6/8 votes)
- ✅ **Difficulty Filtering**: Reject samples that baseline Qwen3-VL-8B answers correctly >50% of the time
- ✅ **Visual Grounding**: Bounding boxes overlaid on images to resolve referential ambiguity
## 📂 Data Structure & Extraction
The image data is provided in multiple split compressed files to ensure reliable downloading.
### 1. Extract Training Images
After downloading all `images.tar.gz.*` parts, use the following command to merge and extract them:
```bash
cd images/
# Merge split files and extract to the current directory
cat images.tar.gz* | tar -xvf - -C ./
```
### 2. Original Data & Synthesis (Optional)
If you are interested in how the training data `images.tar.gz.*` was synthesized, you can refer to the [data synthesis script](https://github.com/inclusionAI/Zooming-without-Zooming/blob/main/data_synthesis/create_vqa.py).
The synthesis process uses the **original images**. To extract the source data, follow these steps:
```bash
cd original_images/
# Merge split files and extract to the current directory
cat original_images.tar.gz* | tar -xvf - -C ./
```
Once extracted, you can use the script mentioned above to reproduce the dataset from these original images.
## 🎯 Intended Use
This dataset is designed for:
- **Reinforcement Learning** on MLLMs (e.g., with DAPO/GRPO)
- **Research on distilling tool-use capabilities** into single-pass models
## 📈 Training Results
Models trained on this dataset (ZwZ-4B/7B/8B) achieve:
| Model | ZoomBench | HR-Bench-4K | HR-Bench-8K | VStar |
|-------|-----------|-------------|-------------|-------|
| ZwZ-4B | **55.74** | 81.75 | 79.50 | **92.67** |
| ZwZ-7B | 55.62 | 75.38 | 73.25 | 88.48 |
| ZwZ-8B | **58.11** | **84.38** | **82.00** | 91.10 |
*vs. Qwen3-VL-8B baseline: 37.87 / 78.88 / 74.63 / 86.39*
## 🔗 Related Resources
| Resource | Link |
|----------|------|
| **Paper** | [arXiv:2602.11858](https://arxiv.org/pdf/2602.11858) |
| **Code** | [GitHub: Zooming-without-Zooming](https://github.com/inclusionAI/Zooming-without-Zooming) |
| **ZwZ-4B Model** | [inclusionAI/ZwZ-4B](https://huggingface.co/inclusionAI/ZwZ-4B) |
| **ZwZ-7B Model** | [inclusionAI/ZwZ-7B](https://huggingface.co/inclusionAI/ZwZ-7B) |
| **ZwZ-8B Model** | [inclusionAI/ZwZ-8B](https://huggingface.co/inclusionAI/ZwZ-8B) |
| **ZoomBench** | [inclusionAI/ZoomBench](https://huggingface.co/datasets/inclusionAI/ZoomBench) |
## 📄 Citation
```bibtex
@article{wei2026zooming,
title={Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception},
author={Wei, Lai and He, Liangbo and Lan, Jun and Dong, Lingzhong and Cai, Yutong and Li, Siyuan and Zhu, Huijia and Wang, Weiqiang and Kong, Linghe and Wang, Yue and Zhang, Zhuosheng and Huang, Weiran},
journal={arXiv preprint arXiv:2602.11858},
year={2026}
}
```
## 📝 License
Apache-2.0 License
提供机构:
Tennet7



