CSU-JPG/VPBench

Name: CSU-JPG/VPBench
Creator: CSU-JPG
Published: 2026-04-09 02:18:45
License: 暂无描述

Hugging Face2026-04-09 更新2026-05-10 收录

下载链接：

https://hf-mirror.com/datasets/CSU-JPG/VPBench

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 dataset_info: - config_name: class2image features: - name: pair_id dtype: string - name: subset dtype: string - name: category dtype: string - name: image_name dtype: string - name: input_relpath dtype: string - name: output_relpath dtype: string - name: recognized_text dtype: string - name: input_image dtype: image - name: output_image dtype: image splits: - name: train num_bytes: 40402985 num_examples: 100 download_size: 40414187 dataset_size: 40402985 - config_name: doodles features: - name: pair_id dtype: string - name: subset dtype: string - name: category dtype: string - name: image_name dtype: string - name: input_relpath dtype: string - name: output_relpath dtype: string - name: recognized_text dtype: string - name: input_image dtype: image - name: output_image dtype: image splits: - name: train num_bytes: 40127584 num_examples: 50 download_size: 40130903 dataset_size: 40127584 - config_name: force features: - name: pair_id dtype: string - name: subset dtype: string - name: category dtype: string - name: image_name dtype: string - name: input_relpath dtype: string - name: output_relpath dtype: string - name: recognized_text dtype: string - name: input_image dtype: image - name: output_image dtype: image splits: - name: train num_bytes: 82976644 num_examples: 150 download_size: 82966446 dataset_size: 82976644 - config_name: text2image features: - name: pair_id dtype: string - name: subset dtype: string - name: category dtype: string - name: image_name dtype: string - name: input_relpath dtype: string - name: output_relpath dtype: string - name: recognized_text dtype: string - name: input_image dtype: image - name: output_image dtype: image splits: - name: train num_bytes: 22163018 num_examples: 50 download_size: 22166087 dataset_size: 22163018 - config_name: text_box_control features: - name: pair_id dtype: string - name: subset dtype: string - name: category dtype: string - name: image_name dtype: string - name: input_relpath dtype: string - name: output_relpath dtype: string - name: recognized_text dtype: string - name: input_image dtype: image - name: output_image dtype: image splits: - name: train num_bytes: 37039690 num_examples: 50 download_size: 37045318 dataset_size: 37039690 - config_name: text_in_image features: - name: pair_id dtype: string - name: subset dtype: string - name: category dtype: string - name: image_name dtype: string - name: input_relpath dtype: string - name: output_relpath dtype: string - name: recognized_text dtype: string - name: input_image dtype: image - name: output_image dtype: image splits: - name: train num_bytes: 214739988 num_examples: 290 download_size: 214733929 dataset_size: 214739988 - config_name: trajectory features: - name: pair_id dtype: string - name: subset dtype: string - name: category dtype: string - name: image_name dtype: string - name: input_relpath dtype: string - name: output_relpath dtype: string - name: recognized_text dtype: string - name: input_image dtype: image - name: output_image dtype: image splits: - name: train num_bytes: 8089366 num_examples: 50 download_size: 8090278 dataset_size: 8089366 - config_name: vismarker features: - name: pair_id dtype: string - name: subset dtype: string - name: category dtype: string - name: image_name dtype: string - name: input_relpath dtype: string - name: output_relpath dtype: string - name: recognized_text dtype: string - name: input_image dtype: image - name: output_image dtype: image splits: - name: train num_bytes: 241608849 num_examples: 320 download_size: 241592510 dataset_size: 241608849 configs: - config_name: class2image data_files: - split: train path: class2image/train-* - config_name: doodles data_files: - split: train path: doodles/train-* - config_name: force data_files: - split: train path: force/train-* - config_name: text2image data_files: - split: train path: text2image/train-* - config_name: text_box_control data_files: - split: train path: text_box_control/train-* - config_name: text_in_image data_files: - split: train path: text_in_image/train-* - config_name: trajectory data_files: - split: train path: trajectory/train-* - config_name: vismarker data_files: - split: train path: vismarker/train-* task_categories: - image-to-image - text-to-image language: - en size_categories: - 1K<n<10K --- <div align="center"> <h2 align="center" style="margin-top: 0; margin-bottom: 15px;"> FlowInOne: Unifying Multimodal Generation as Image-in, Image-out Flow Matching </h2> TL;DR: The first vision-centric image-in, image-out image generation model. <a href="https://csu-jpg.github.io/FlowInOne.github.io/" style="text-decoration: none;">🌐 Homepage</a> | <a href="https://github.com/CSU-JPG/FlowInOne" style="text-decoration: none;">💻 Code</a> | <a href="https://arxiv.org/pdf/2604.06757" style="text-decoration: none;">📄 Paper</a> | <a href="https://huggingface.co/datasets/CSU-JPG/VisPrompt5M" style="text-decoration: none;">📁 Dataset</a> | <a href="https://huggingface.co/datasets/CSU-JPG/VPBench" style="text-decoration: none;">🌏 Benchmark</a> | <a href="https://huggingface.co/CSU-JPG/FlowInOne" style="text-decoration: none;">🤗 Model</a> </div> # VP-Bench **VP-Bench** is the official evaluation benchmark for [**FlowInOne**](https://csu-jpg.github.io/FlowInOne.github.io/). It is a rigorously curated benchmark assessing **instruction faithfulness**, **spatial precision**, **visual realism**, and **content consistency** across eight distinct visual prompting tasks. ## Evaluation Our evaluation scripts are now available on [GitHub](https://github.com/CSU-JPG/FlowInOne)! ## Dataset Subsets The dataset contains **8 subsets**, each corresponding to a distinct visual instruction task: | Subset | Abbrev. | Description | |--------|---------|-------------| | `class2image` | C2I | Class label rendered in input image → generate corresponding image | | `text2image` | T2I | Text instruction rendered in input image → generate image | | `text_in_image` | TIE | Edit text content within an image | | `force` | FU | Physics-aware force understanding (3 categories) | | `text_box_control` | TBE | Text and bounding box editing | | `trajectory` | TU | Trajectory understanding and prediction | | `vismarker` | VME | Visual marker guided editing (8 categories) | | `doodles` | DE | Doodle-based editing | ## Dataset Features - **input_image** (`image`): The input visual prompt image (with rendered instruction). - **output_image** (`image`): The ground-truth output image. - **recognized_text** (`string`): The text instruction rendered in the input image (extracted via OCR annotation). - **subset** (`string`): The subset name. - **category** (`string`): Sub-category within a subset (empty string if not applicable). - **image_name** (`string`): The image filename. - **input_relpath** (`string`): Relative path of the input image within the subset. - **output_relpath** (`string`): Relative path of the output image within the subset. - **pair_id** (`string`): Stable SHA1 identifier for each input-output pair. ## Loading the Dataset ```python # class2image from datasets import load_dataset ds = load_dataset("CSU-JPG/VPBench", "class2image", split="train") # text2image from datasets import load_dataset ds = load_dataset("CSU-JPG/VPBench", "text2image", split="train") # text_in_image from datasets import load_dataset ds = load_dataset("CSU-JPG/VPBench", "text_in_image", split="train") # force from datasets import load_dataset ds = load_dataset("CSU-JPG/VPBench", "force", split="train") # text_box_control from datasets import load_dataset ds = load_dataset("CSU-JPG/VPBench", "text_box_control", split="train") # trajectory from datasets import load_dataset ds = load_dataset("CSU-JPG/VPBench", "trajectory", split="train") # vismarker from datasets import load_dataset ds = load_dataset("CSU-JPG/VPBench", "vismarker", split="train") # doodles from datasets import load_dataset ds = load_dataset("CSU-JPG/VPBench", "doodles", split="train") # Load All Subsets from datasets import load_dataset, concatenate_datasets subsets = ["class2image", "text2image", "text_in_image", "force", "text_box_control", "trajectory", "vismarker", "doodles"] ds_all = concatenate_datasets([ load_dataset("CSU-JPG/VPBench", name=s, split="train") for s in subsets ]) ``` ## Evaluation Results We evaluate multiple methods on VP-Bench using three state-of-the-art VLM evaluators (Gemini3, GPT-5.2, Qwen3.5) and human judges. The metric is success ratio (higher is better). Total denotes the average success rate across all eight task categories. Abbreviations: C2I: class-to-image · T2I: text-to-image · TIE: text-in-image edit · FU: force understanding · TBE: text & bbox edit · TU: trajectory understanding · VME: visual marker edit · DE: doodles edit **Evaluator: Gemini3** | Method | C2I | T2I | TIE | FU | TBE | TU | VME | DE | **Total** | |--------|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---------:| | Nano Banana (Google, 2025) | .650 | .980 | .423 | .520 | .614 | .020 | .548 | .721 | .560 | | Omnigen2 (Wu et al., 2025) | .020 | .020 | .017 | .020 | .000 | .000 | .000 | .000 | .007 | | Kontext (Labs et al., 2025) | .050 | .020 | .048 | .007 | .000 | .020 | .010 | .000 | .019 | | Qwen-IE-2509 (Wu et al., 2025) | .230 | .040 | .069 | .000 | .000 | .020 | .023 | .000 | .048 | | **FlowInOne (Ours)** | **.890** | **.700** | **.355** | **.727** | **.302** | **.520** | **.292** | **.535** | **.540** | **Evaluator: GPT-5.2** | Method | C2I | T2I | TIE | FU | TBE | TU | VME | DE | **Total** | |--------|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---------:| | Nano Banana (Google, 2025) | .680 | .959 | .152 | .127 | .023 | .040 | .136 | .302 | .302 | | Omnigen2 (Wu et al., 2025) | .110 | .020 | .000 | .000 | .000 | .000 | .000 | .023 | .019 | | Kontext (Labs et al., 2025) | .090 | .020 | .028 | .020 | .000 | .080 | .003 | .093 | .042 | | Qwen-IE-2509 (Wu et al., 2025) | .240 | .120 | .080 | .020 | .022 | .060 | .020 | .047 | .076 | | **FlowInOne (Ours)** | **.850** | **.800** | .079 | **.500** | **.116** | **.240** | .083 | **.465** | **.392** | **Evaluator: Qwen3.5** | Method | C2I | T2I | TIE | FU | TBE | TU | VME | DE | **Total** | |--------|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---------:| | Nano Banana (Google, 2025) | .600 | .959 | .386 | .367 | .257 | .040 | .321 | .744 | .469 | | Omnigen2 (Wu et al., 2025) | .030 | .020 | .017 | .034 | .000 | .000 | .003 | .047 | .019 | | Kontext (Labs et al., 2025) | .050 | .020 | .042 | .133 | .000 | .060 | .047 | .093 | .056 | | Qwen-IE-2509 (Wu et al., 2025) | .270 | .060 | .080 | .087 | .047 | .040 | .033 | .047 | .083 | | **FlowInOne (Ours)** | **.859** | **.720** | **.354** | **.713** | **.272** | **.320** | **.306** | **.481** | **.503** | **Evaluator: Human** | Method | C2I | T2I | TIE | FU | TBE | TU | VME | DE | **Total** | |--------|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---------:| | Nano Banana (Google, 2025) | .602 | .904 | .271 | .250 | .200 | .050 | .229 | .742 | .406 | | Omnigen2 (Wu et al., 2025) | .000 | .000 | .000 | .000 | .000 | .000 | .000 | .000 | .000 | | Kontext (Labs et al., 2025) | .000 | .000 | .043 | .000 | .000 | .000 | .000 | .100 | .018 | | Qwen-IE-2509 (Wu et al., 2025) | .067 | .000 | .029 | .000 | .000 | .000 | .000 | .000 | .012 | | **FlowInOne (Ours)** | **.800** | **.645** | **.242** | **.705** | **.255** | **.280** | **.255** | **.400** | **.449** | ## Citation If you found our work useful, please consider citing: ``` @article{yi2026flowinoneunifyingmultimodalgenerationimagein, title={FlowInOne:Unifying Multimodal Generation as Image-in, Image-out Flow Matching}, author={Junchao Yi and Rui Zhao and Jiahao Tang and Weixian Lei and Linjie Li and Qisheng Su and Zhengyuan Yang and Lijuan Wang and Xiaofeng Zhu and Alex Jinpeng Wang}, journal={arXiv preprint arXiv:2604.06757}, year={2026} } ```

提供机构：

CSU-JPG

5,000+

优质数据集

54 个

任务类型

进入经典数据集