YuhengSSS/Q-Zoom-Training
收藏Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/YuhengSSS/Q-Zoom-Training
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
language:
- en
task_categories:
- visual-question-answering
tags:
- q-zoom
- region-of-interest
- vision-language-model
- qwen2.5-vl
- qwen3-vl
size_categories:
- 100K<n<1M
---
# Q-Zoom Training Data
Curated training data for the **Q-Zoom** gated Region-of-Interest mechanism for Vision-Language Models. Companion dataset to the [Q-Zoom release repository](https://github.com/YuhengSSS).
## What this repo contains
This repo holds the **question JSONLs** and the **ROI training pickles** used to train the three components of Q-Zoom (SD-RPN, Post-SFT, Dynamic Gate). It is the companion to:
- **[YuhengSSS/RoITraining](https://huggingface.co/datasets/YuhengSSS/RoITraining)** — image archives (`*.tar` / `*.zip`) for COCO / GQA / OCR-VQA / DocVQA / TextVQA / ChartQA / InfographicsVQA. You need both repos for end-to-end training; the split exists because the image archives are redistributed under their upstream licenses while this repo holds model-derived pseudo labels and Q-Zoom-specific universal inputs.
## Mapping to the Q-Zoom paper
Each Q-Zoom component (SD-RPN / Post-SFT / Dynamic Gate) maps 1:1 to one training stage:
- **SD-RPN** ⇄ Stage 1 (the TWIG branch is initialized from pseudo ROI maps)
- **Post-SFT** ⇄ Stage 2 (the LLM is fine-tuned under Q-Zoom gating)
- **Dynamic Gate** ⇄ Stage 3 (the high-resolution gating network is refined)
| Component | Training source | Samples | File in this repo |
|---|---|---:|---|
| **SD-RPN** | GQA | 72K | `llava_v1_5_mix665k_selected_qa.jsonl` (filtered to `gqa` records) |
| | OCR-VQA | 80K | `llava_v1_5_mix665k_selected_qa.jsonl` (filtered to `ocr_vqa` records) |
| | VCoT-DocVQA | 33K | `visual_cot_docvqa_subset33k.jsonl` |
| | *Total* | *185K* | |
| **Post-SFT** | TextVQA<sub>train</sub> | 34K | `textvqa/converted_llava_style_train.jsonl` |
| | ChartQA<sub>train</sub> | 28K | `chartqa_28k_qa.jsonl` |
| | VCoT-InfoVQA | 15K | `visual_cot_infovqa_subset15k.jsonl` |
| | VCoT-DocVQA | 33K | `visual_cot_docvqa_subset33k.jsonl` |
| | V\*-COCO | 44K | `vstar_coco_spatial_relation_data.jsonl` |
| **Dynamic Gate** | VCoT-TextVQA + VCoT-GQA (merged) | 18K + 50K = 68K | `visual_cot_llava_subset68k.jsonl` |
| | VCoT-DocVQA | 33K | `visual_cot_docvqa_subset33k.jsonl` |
| | ChartQA<sub>train</sub> | 28K | `chartqa_28k_qa.jsonl` |
## Pre-built training files (skip data generation)
If you only want to **train** Q-Zoom and not regenerate Stage-1 pseudo labels, the Stage-2 judged post-SFT mixture, or the Stage-3 ROI mixture from scratch, point the training scripts straight at one of the pre-built per-backbone files below instead of running the `standardized_pipeline/{stage1,stage2,stage3}/` pipelines:
| Stage | Backbone | File |
|---|---|---|
| **Stage-1 pseudo labels (185K)** | Qwen2.5-VL-3B | `qwen2_5vl_pseudo_3b_576res_185k.pkl` |
| | Qwen2.5-VL-7B | `qwen2_5vl_pseudo_7b_576res_185k.pkl` |
| | Qwen3-VL-4B | `qwen3vl_pseudo_4b_576res_185k.pkl` |
| **Stage-2 Post-SFT JSONL** | Qwen2.5-VL-3B | `qwen2_5vl_3b_stage2.jsonl` |
| | Qwen2.5-VL-7B | `qwen2_5vl_7b_stage2.jsonl` |
| | Qwen3-VL-4B | `qwen3vl_4b_stage2.jsonl` |
| **Stage-3 Dynamic Gate ROI pkl** | Qwen2.5-VL-3B | `qwen2_5vl_3b_stage3.pkl` |
| | Qwen2.5-VL-7B | `qwen2_5vl_7b_stage3.pkl` |
| | Qwen3-VL-4B | `qwen3vl_4b_stage3.pkl` |
The Stage-2 JSONLs are the **judged post-SFT mixture** produced by `standardized_pipeline/stage2` — they hold the per-backbone subset of (TextVQA, ChartQA, VCoT-DocVQA, VCoT-InfoVQA, V*-COCO) on which the Stage-1 ROI model and the base VLM disagree, with the winning answer kept as the SFT target. They are backbone-specific because the base model and the Stage-1 ROI model differ between Qwen2.5-VL and Qwen3-VL — do **not** mix them across backbones.
## Universal inputs
Three pre-built universal-input JSONLs that the `standardized_pipeline/{stage1,stage2,stage3}/` scripts in the Q-Zoom repo would otherwise generate from the source question files:
- `stage1_universal_input.jsonl`
- `stage2_universal_input.jsonl`
- `stage3_universal_input.jsonl`
You can use these directly to skip the `build_universal_input.py` step at each stage.
## Download
```bash
pip install -U "huggingface_hub[cli]"
# Pull everything into ${DATA_ROOT}
huggingface-cli download YuhengSSS/Q-Zoom-Training \
--repo-type dataset \
--local-dir "${DATA_ROOT}" --local-dir-use-symlinks False
```
Or download only the files you need (e.g. for one backbone):
```bash
huggingface-cli download YuhengSSS/Q-Zoom-Training \
--repo-type dataset \
--local-dir "${DATA_ROOT}" --local-dir-use-symlinks False \
--include "qwen2_5vl_pseudo_7b_576res_185k.pkl" \
--include "qwen2_5vl_7b_stage3.pkl" \
--include "*.jsonl"
```
## File-format notes
- All `.jsonl` files contain one JSON object per line. Stage-1 and the per-source pools (`llava_v1_5_*`, `chartqa_28k_*`, `vstar_coco_*`, `visual_cot_*`, `textvqa/converted_llava_style_train.jsonl`) follow the LLaVA-style `{"id", "image", "conversations"}` schema.
- The three `stage{1,2,3}_universal_input.jsonl` files use the Q-Zoom-specific universal-input schema produced by `standardized_pipeline/<stage>/build_universal_input.py` (one record per `(uid, dataset, image, text, mode)` tuple).
- The `*_pseudo_*.pkl` files store dict-of-lists keyed by `question_id`, with sigmoid-activated ROI attention maps and the original prompts. They are loaded directly by `qwen-vl-finetune/qwenvl/data/data_qwen.py` when `--roi_data_path` is set.
- The `qwen*_stage2.jsonl` files use the same LLaVA-style `{"id", "image", "conversations"}` schema and are consumed by `qwen-vl-finetune` as the post-SFT data when `--roi_post_training True` is set.
- The `*_stage3.pkl` files have the same on-disk schema as the pseudo files but with the Stage-3 dataset mixture (Dynamic Gate training set).
## License
This repo redistributes:
- **Question subsets** (`llava_v1_5_mix665k_selected_qa.jsonl`, `visual_cot_*`, etc.) under the same terms as their upstream sources (LLaVA-1.5, Visual-CoT). Please consult those for any commercial use.
- **Q-Zoom-derived files** (`stage{1,2,3}_universal_input.jsonl`, `*_pseudo_*.pkl`, `*_stage3.pkl`) under Apache 2.0, matching the Q-Zoom repository.
## Citation
If you use this data, please cite the Q-Zoom paper:
```bibtex
@article{qzoom,
title = {Q-Zoom: Gated Region-of-Interest for Vision-Language Models},
author = {<author list>},
year = {2026}
}
```
提供机构:
YuhengSSS



