YuhengSSS/Q-Zoom-Training

Name: YuhengSSS/Q-Zoom-Training
Creator: YuhengSSS
Published: 2026-04-08 01:28:58
License: 暂无描述

Hugging Face2026-04-08 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/YuhengSSS/Q-Zoom-Training

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 language: - en task_categories: - visual-question-answering tags: - q-zoom - region-of-interest - vision-language-model - qwen2.5-vl - qwen3-vl size_categories: - 100K<n<1M --- # Q-Zoom Training Data Curated training data for the **Q-Zoom** gated Region-of-Interest mechanism for Vision-Language Models. Companion dataset to the [Q-Zoom release repository](https://github.com/YuhengSSS). ## What this repo contains This repo holds the **question JSONLs** and the **ROI training pickles** used to train the three components of Q-Zoom (SD-RPN, Post-SFT, Dynamic Gate). It is the companion to: - **[YuhengSSS/RoITraining](https://huggingface.co/datasets/YuhengSSS/RoITraining)** — image archives (`*.tar` / `*.zip`) for COCO / GQA / OCR-VQA / DocVQA / TextVQA / ChartQA / InfographicsVQA. You need both repos for end-to-end training; the split exists because the image archives are redistributed under their upstream licenses while this repo holds model-derived pseudo labels and Q-Zoom-specific universal inputs. ## Mapping to the Q-Zoom paper Each Q-Zoom component (SD-RPN / Post-SFT / Dynamic Gate) maps 1:1 to one training stage: - **SD-RPN** ⇄ Stage 1 (the TWIG branch is initialized from pseudo ROI maps) - **Post-SFT** ⇄ Stage 2 (the LLM is fine-tuned under Q-Zoom gating) - **Dynamic Gate** ⇄ Stage 3 (the high-resolution gating network is refined) | Component | Training source | Samples | File in this repo | |---|---|---:|---| | **SD-RPN** | GQA | 72K | `llava_v1_5_mix665k_selected_qa.jsonl` (filtered to `gqa` records) | | | OCR-VQA | 80K | `llava_v1_5_mix665k_selected_qa.jsonl` (filtered to `ocr_vqa` records) | | | VCoT-DocVQA | 33K | `visual_cot_docvqa_subset33k.jsonl` | | | *Total* | *185K* | | | **Post-SFT** | TextVQAtrain | 34K | `textvqa/converted_llava_style_train.jsonl` | | | ChartQAtrain | 28K | `chartqa_28k_qa.jsonl` | | | VCoT-InfoVQA | 15K | `visual_cot_infovqa_subset15k.jsonl` | | | VCoT-DocVQA | 33K | `visual_cot_docvqa_subset33k.jsonl` | | | V\*-COCO | 44K | `vstar_coco_spatial_relation_data.jsonl` | | **Dynamic Gate** | VCoT-TextVQA + VCoT-GQA (merged) | 18K + 50K = 68K | `visual_cot_llava_subset68k.jsonl` | | | VCoT-DocVQA | 33K | `visual_cot_docvqa_subset33k.jsonl` | | | ChartQAtrain | 28K | `chartqa_28k_qa.jsonl` | ## Pre-built training files (skip data generation) If you only want to **train** Q-Zoom and not regenerate Stage-1 pseudo labels, the Stage-2 judged post-SFT mixture, or the Stage-3 ROI mixture from scratch, point the training scripts straight at one of the pre-built per-backbone files below instead of running the `standardized_pipeline/{stage1,stage2,stage3}/` pipelines: | Stage | Backbone | File | |---|---|---| | **Stage-1 pseudo labels (185K)** | Qwen2.5-VL-3B | `qwen2_5vl_pseudo_3b_576res_185k.pkl` | | | Qwen2.5-VL-7B | `qwen2_5vl_pseudo_7b_576res_185k.pkl` | | | Qwen3-VL-4B | `qwen3vl_pseudo_4b_576res_185k.pkl` | | **Stage-2 Post-SFT JSONL** | Qwen2.5-VL-3B | `qwen2_5vl_3b_stage2.jsonl` | | | Qwen2.5-VL-7B | `qwen2_5vl_7b_stage2.jsonl` | | | Qwen3-VL-4B | `qwen3vl_4b_stage2.jsonl` | | **Stage-3 Dynamic Gate ROI pkl** | Qwen2.5-VL-3B | `qwen2_5vl_3b_stage3.pkl` | | | Qwen2.5-VL-7B | `qwen2_5vl_7b_stage3.pkl` | | | Qwen3-VL-4B | `qwen3vl_4b_stage3.pkl` | The Stage-2 JSONLs are the **judged post-SFT mixture** produced by `standardized_pipeline/stage2` — they hold the per-backbone subset of (TextVQA, ChartQA, VCoT-DocVQA, VCoT-InfoVQA, V*-COCO) on which the Stage-1 ROI model and the base VLM disagree, with the winning answer kept as the SFT target. They are backbone-specific because the base model and the Stage-1 ROI model differ between Qwen2.5-VL and Qwen3-VL — do **not** mix them across backbones. ## Universal inputs Three pre-built universal-input JSONLs that the `standardized_pipeline/{stage1,stage2,stage3}/` scripts in the Q-Zoom repo would otherwise generate from the source question files: - `stage1_universal_input.jsonl` - `stage2_universal_input.jsonl` - `stage3_universal_input.jsonl` You can use these directly to skip the `build_universal_input.py` step at each stage. ## Download ```bash pip install -U "huggingface_hub[cli]" # Pull everything into ${DATA_ROOT} huggingface-cli download YuhengSSS/Q-Zoom-Training \ --repo-type dataset \ --local-dir "${DATA_ROOT}" --local-dir-use-symlinks False ``` Or download only the files you need (e.g. for one backbone): ```bash huggingface-cli download YuhengSSS/Q-Zoom-Training \ --repo-type dataset \ --local-dir "${DATA_ROOT}" --local-dir-use-symlinks False \ --include "qwen2_5vl_pseudo_7b_576res_185k.pkl" \ --include "qwen2_5vl_7b_stage3.pkl" \ --include "*.jsonl" ``` ## File-format notes - All `.jsonl` files contain one JSON object per line. Stage-1 and the per-source pools (`llava_v1_5_*`, `chartqa_28k_*`, `vstar_coco_*`, `visual_cot_*`, `textvqa/converted_llava_style_train.jsonl`) follow the LLaVA-style `{"id", "image", "conversations"}` schema. - The three `stage{1,2,3}_universal_input.jsonl` files use the Q-Zoom-specific universal-input schema produced by `standardized_pipeline/<stage>/build_universal_input.py` (one record per `(uid, dataset, image, text, mode)` tuple). - The `*_pseudo_*.pkl` files store dict-of-lists keyed by `question_id`, with sigmoid-activated ROI attention maps and the original prompts. They are loaded directly by `qwen-vl-finetune/qwenvl/data/data_qwen.py` when `--roi_data_path` is set. - The `qwen*_stage2.jsonl` files use the same LLaVA-style `{"id", "image", "conversations"}` schema and are consumed by `qwen-vl-finetune` as the post-SFT data when `--roi_post_training True` is set. - The `*_stage3.pkl` files have the same on-disk schema as the pseudo files but with the Stage-3 dataset mixture (Dynamic Gate training set). ## License This repo redistributes: - **Question subsets** (`llava_v1_5_mix665k_selected_qa.jsonl`, `visual_cot_*`, etc.) under the same terms as their upstream sources (LLaVA-1.5, Visual-CoT). Please consult those for any commercial use. - **Q-Zoom-derived files** (`stage{1,2,3}_universal_input.jsonl`, `*_pseudo_*.pkl`, `*_stage3.pkl`) under Apache 2.0, matching the Q-Zoom repository. ## Citation If you use this data, please cite the Q-Zoom paper: ```bibtex @article{qzoom, title = {Q-Zoom: Gated Region-of-Interest for Vision-Language Models}, author = {<author list>}, year = {2026} } ```

提供机构：

YuhengSSS

5,000+

优质数据集

54 个

任务类型

进入经典数据集