Tennet7/ZwZ-RL-VQA

Name: Tennet7/ZwZ-RL-VQA
Creator: Tennet7
Published: 2026-04-28 07:33:43
License: 暂无描述

Hugging Face2026-04-28 更新2026-05-03 收录

下载链接：

https://hf-mirror.com/datasets/Tennet7/ZwZ-RL-VQA

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: apache-2.0 tags: - multimodal - vision-language-model - fine-grained-perception - vqa - region-to-image-distillation datasets: - sa-1b - laion - visual-genome - cc12m size_categories: - 10K<n<100K --- # ZwZ-RL-VQA: Region-to-Image Distilled Training Data for Fine-Grained Perception This dataset contains **74K high-quality VQA pairs** generated via **Region-to-Image Distillation (R2I)** for training multimodal large language models (MLLMs) on fine-grained perception tasks without test-time tool use. ## 📖 Overview The **Zooming without Zooming (ZwZ)** method transforms "zooming" from an inference-time tool into a training-time primitive: 1. **Zoom-in Synthesis**: Strong teacher models (Qwen3-VL-235B, GLM-4.5V) generate questions and answers on micro-cropped regions where fine details are unambiguous 2. **Zoom-out Distillation**: Region-grounded supervision is distilled back to full images with explicit bounding-box overlays 3. **Single-Pass Inference**: Trained models internalize zooming benefits, achieving fine-grained perception in one forward pass ## 📊 Dataset Statistics | Attribute | Value | |-----------|-------| | **Total Samples** | 74,000 | | **Source Images** | SA-1B, LAION, MetaCLIP, Visual Genome, CC12M, STPLS3D | | **Image Resolution** | Mostly > 1000×1000 (high-resolution) | | **Crop Ratio** | mostly < 10% of full image area (fine-grained focus) | | **Question Types** | Counting, OCR, Color, Structure, Material, Identification | | **Consensus Filter** | >6/8 agreement among teacher ensembles | ## 🏗️ Data Generation Pipeline ### Teachers Used | Role | Model | |------|-------| | **Question Generator** | Qwen3-VL-235B-A22B-Instruct | | **Answer Generator 1** | Qwen3-VL-235B-A22B-Instruct | | **Answer Generator 2** | GLM-4.5V | ### Quality Control - ✅ **Consensus Filtering**: Only retain QA pairs with >75% teacher agreement (6/8 votes) - ✅ **Difficulty Filtering**: Reject samples that baseline Qwen3-VL-8B answers correctly >50% of the time - ✅ **Visual Grounding**: Bounding boxes overlaid on images to resolve referential ambiguity ## 📂 Data Structure & Extraction The image data is provided in multiple split compressed files to ensure reliable downloading. ### 1. Extract Training Images After downloading all `images.tar.gz.*` parts, use the following command to merge and extract them: ```bash cd images/ # Merge split files and extract to the current directory cat images.tar.gz* | tar -xvf - -C ./ ``` ### 2. Original Data & Synthesis (Optional) If you are interested in how the training data `images.tar.gz.*` was synthesized, you can refer to the [data synthesis script](https://github.com/inclusionAI/Zooming-without-Zooming/blob/main/data_synthesis/create_vqa.py). The synthesis process uses the **original images**. To extract the source data, follow these steps: ```bash cd original_images/ # Merge split files and extract to the current directory cat original_images.tar.gz* | tar -xvf - -C ./ ``` Once extracted, you can use the script mentioned above to reproduce the dataset from these original images. ## 🎯 Intended Use This dataset is designed for: - **Reinforcement Learning** on MLLMs (e.g., with DAPO/GRPO) - **Research on distilling tool-use capabilities** into single-pass models ## 📈 Training Results Models trained on this dataset (ZwZ-4B/7B/8B) achieve: | Model | ZoomBench | HR-Bench-4K | HR-Bench-8K | VStar | |-------|-----------|-------------|-------------|-------| | ZwZ-4B | **55.74** | 81.75 | 79.50 | **92.67** | | ZwZ-7B | 55.62 | 75.38 | 73.25 | 88.48 | | ZwZ-8B | **58.11** | **84.38** | **82.00** | 91.10 | *vs. Qwen3-VL-8B baseline: 37.87 / 78.88 / 74.63 / 86.39* ## 🔗 Related Resources | Resource | Link | |----------|------| | **Paper** | [arXiv:2602.11858](https://arxiv.org/pdf/2602.11858) | | **Code** | [GitHub: Zooming-without-Zooming](https://github.com/inclusionAI/Zooming-without-Zooming) | | **ZwZ-4B Model** | [inclusionAI/ZwZ-4B](https://huggingface.co/inclusionAI/ZwZ-4B) | | **ZwZ-7B Model** | [inclusionAI/ZwZ-7B](https://huggingface.co/inclusionAI/ZwZ-7B) | | **ZwZ-8B Model** | [inclusionAI/ZwZ-8B](https://huggingface.co/inclusionAI/ZwZ-8B) | | **ZoomBench** | [inclusionAI/ZoomBench](https://huggingface.co/datasets/inclusionAI/ZoomBench) | ## 📄 Citation ```bibtex @article{wei2026zooming, title={Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception}, author={Wei, Lai and He, Liangbo and Lan, Jun and Dong, Lingzhong and Cai, Yutong and Li, Siyuan and Zhu, Huijia and Wang, Weiqiang and Kong, Linghe and Wang, Yue and Zhang, Zhuosheng and Huang, Weiran}, journal={arXiv preprint arXiv:2602.11858}, year={2026} } ``` ## 📝 License Apache-2.0 License

提供机构：

Tennet7

5,000+

优质数据集

54 个

任务类型

进入经典数据集