tiiuae/visres_bench
收藏Hugging Face2026-03-10 更新2026-05-10 收录
下载链接:
https://hf-mirror.com/datasets/tiiuae/visres_bench
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- visual-question-answering
- image-to-text
language:
- en
tags:
- benchmark
- vision
- reasoning
- multimodal
- evaluation
pretty_name: VisRes-Bench
dataset_info:
- config_name: level_1_global_occlusion_50
features:
- name: id
dtype: string
- name: task
dtype: string
- name: level
dtype: string
- name: guided_question
dtype: string
- name: generic_question
dtype: string
- name: images
sequence: image
- name: question
dtype: string
- name: answer
dtype: string
splits:
- name: test
num_examples: 1000
- config_name: level_1_global_occlusion_70
features:
- name: id
dtype: string
- name: task
dtype: string
- name: level
dtype: string
- name: guided_question
dtype: string
- name: generic_question
dtype: string
- name: images
sequence: image
- name: question
dtype: string
- name: answer
dtype: string
splits:
- name: test
num_examples: 1000
- config_name: level_1_global_occlusion_80
features:
- name: id
dtype: string
- name: task
dtype: string
- name: level
dtype: string
- name: guided_question
dtype: string
- name: generic_question
dtype: string
- name: images
sequence: image
- name: question
dtype: string
- name: answer
dtype: string
splits:
- name: test
num_examples: 1000
- config_name: level_1_edges
features:
- name: id
dtype: string
- name: task
dtype: string
- name: level
dtype: string
- name: guided_question
dtype: string
- name: generic_question
dtype: string
- name: images
sequence: image
- name: question
dtype: string
- name: answer
dtype: string
splits:
- name: test
num_examples: 1000
- config_name: level_1_location_random_sampling
features:
- name: id
dtype: string
- name: task
dtype: string
- name: level
dtype: string
- name: guided_question
dtype: string
- name: generic_question
dtype: string
- name: images
sequence: image
- name: question
dtype: string
- name: answer
dtype: string
splits:
- name: test
num_examples: 1000
- config_name: level_1_brightness
features:
- name: id
dtype: string
- name: task
dtype: string
- name: level
dtype: string
- name: guided_question
dtype: string
- name: generic_question
dtype: string
- name: images
sequence: image
- name: question
dtype: string
- name: answer
dtype: string
splits:
- name: test
num_examples: 1000
- config_name: level_1_blur
features:
- name: id
dtype: string
- name: task
dtype: string
- name: level
dtype: string
- name: guided_question
dtype: string
- name: generic_question
dtype: string
- name: images
sequence: image
- name: question
dtype: string
- name: answer
dtype: string
splits:
- name: test
num_examples: 1000
- config_name: level_1_rotation
features:
- name: id
dtype: string
- name: task
dtype: string
- name: level
dtype: string
- name: guided_question
dtype: string
- name: generic_question
dtype: string
- name: images
sequence: image
- name: question
dtype: string
- name: answer
dtype: string
splits:
- name: test
num_examples: 1000
- config_name: level_1_rotation_random_sampling
features:
- name: id
dtype: string
- name: task
dtype: string
- name: level
dtype: string
- name: guided_question
dtype: string
- name: generic_question
dtype: string
- name: images
sequence: image
- name: question
dtype: string
- name: answer
dtype: string
splits:
- name: test
num_examples: 1000
- config_name: level_1_edges_random_sampling
features:
- name: id
dtype: string
- name: task
dtype: string
- name: level
dtype: string
- name: guided_question
dtype: string
- name: generic_question
dtype: string
- name: images
sequence: image
- name: question
dtype: string
- name: answer
dtype: string
splits:
- name: test
num_examples: 1000
- config_name: level_1_location
features:
- name: id
dtype: string
- name: task
dtype: string
- name: level
dtype: string
- name: guided_question
dtype: string
- name: generic_question
dtype: string
- name: images
sequence: image
- name: question
dtype: string
- name: answer
dtype: string
splits:
- name: test
num_examples: 1000
- config_name: level_2_uniform_count
features:
- name: id
dtype: string
- name: task
dtype: string
- name: level
dtype: string
- name: guided_question
dtype: string
- name: generic_question
dtype: string
- name: images
sequence: image
- name: question
dtype: string
- name: answer
dtype: string
splits:
- name: test
num_examples: 500
- config_name: level_2_count_progression
features:
- name: id
dtype: string
- name: task
dtype: string
- name: level
dtype: string
- name: guided_question
dtype: string
- name: generic_question
dtype: string
- name: images
sequence: image
- name: question
dtype: string
- name: answer
dtype: string
splits:
- name: test
num_examples: 500
- config_name: level_2_uniform_orientation
features:
- name: id
dtype: string
- name: task
dtype: string
- name: level
dtype: string
- name: guided_question
dtype: string
- name: generic_question
dtype: string
- name: images
sequence: image
- name: question
dtype: string
- name: answer
dtype: string
splits:
- name: test
num_examples: 458
- config_name: level_2_count_2_same_1_diff
features:
- name: id
dtype: string
- name: task
dtype: string
- name: level
dtype: string
- name: guided_question
dtype: string
- name: generic_question
dtype: string
- name: images
sequence: image
- name: question
dtype: string
- name: answer
dtype: string
splits:
- name: test
num_examples: 500
- config_name: level_2_orientation_2same_1diff
features:
- name: id
dtype: string
- name: task
dtype: string
- name: level
dtype: string
- name: guided_question
dtype: string
- name: generic_question
dtype: string
- name: images
sequence: image
- name: question
dtype: string
- name: answer
dtype: string
splits:
- name: test
num_examples: 498
- config_name: level_2_uniform_color
features:
- name: id
dtype: string
- name: task
dtype: string
- name: level
dtype: string
- name: guided_question
dtype: string
- name: generic_question
dtype: string
- name: images
sequence: image
- name: question
dtype: string
- name: answer
dtype: string
splits:
- name: test
num_examples: 500
- config_name: level_2_count_arithmetic
features:
- name: id
dtype: string
- name: task
dtype: string
- name: level
dtype: string
- name: guided_question
dtype: string
- name: generic_question
dtype: string
- name: images
sequence: image
- name: question
dtype: string
- name: answer
dtype: string
splits:
- name: test
num_examples: 500
- config_name: level_2_count_minmax
features:
- name: id
dtype: string
- name: task
dtype: string
- name: level
dtype: string
- name: guided_question
dtype: string
- name: generic_question
dtype: string
- name: images
sequence: image
- name: question
dtype: string
- name: answer
dtype: string
splits:
- name: test
num_examples: 500
- config_name: level_2_orientation_3_diff
features:
- name: id
dtype: string
- name: task
dtype: string
- name: level
dtype: string
- name: guided_question
dtype: string
- name: generic_question
dtype: string
- name: images
sequence: image
- name: question
dtype: string
- name: answer
dtype: string
splits:
- name: test
num_examples: 500
- config_name: level_2_color_2same_1diff
features:
- name: id
dtype: string
- name: task
dtype: string
- name: level
dtype: string
- name: guided_question
dtype: string
- name: generic_question
dtype: string
- name: images
sequence: image
- name: question
dtype: string
- name: answer
dtype: string
splits:
- name: test
num_examples: 500
- config_name: level_2_color_3_diff
features:
- name: id
dtype: string
- name: task
dtype: string
- name: level
dtype: string
- name: guided_question
dtype: string
- name: generic_question
dtype: string
- name: images
sequence: image
- name: question
dtype: string
- name: answer
dtype: string
splits:
- name: test
num_examples: 500
- config_name: level_2_count_3_diff
features:
- name: id
dtype: string
- name: task
dtype: string
- name: level
dtype: string
- name: guided_question
dtype: string
- name: generic_question
dtype: string
- name: images
sequence: image
- name: question
dtype: string
- name: answer
dtype: string
splits:
- name: test
num_examples: 500
- config_name: level_3_spiral_color_orientation
features:
- name: id
dtype: string
- name: task
dtype: string
- name: level
dtype: string
- name: guided_question
dtype: string
- name: generic_question
dtype: string
- name: images
sequence: image
- name: question
dtype: string
- name: answer
dtype: string
splits:
- name: test
num_examples: 350
- config_name: level_3_spiral_color_orientation
features:
- name: id
dtype: string
- name: task
dtype: string
- name: level
dtype: string
- name: guided_question
dtype: string
- name: generic_question
dtype: string
- name: images
sequence: image
- name: question
dtype: string
- name: answer
dtype: string
splits:
- name: test
num_examples: 464
- config_name: level_3_coupled_color_count
features:
- name: id
dtype: string
- name: task
dtype: string
- name: level
dtype: string
- name: guided_question
dtype: string
- name: generic_question
dtype: string
- name: images
sequence: image
- name: question
dtype: string
- name: answer
dtype: string
splits:
- name: test
num_examples: 500
- config_name: level_3_independent_color_object_orientation
features:
- name: id
dtype: string
- name: task
dtype: string
- name: level
dtype: string
- name: guided_question
dtype: string
- name: generic_question
dtype: string
- name: images
sequence: image
- name: question
dtype: string
- name: answer
dtype: string
splits:
- name: test
num_examples: 355
- config_name: level_3_coupled_color_orientation
features:
- name: id
dtype: string
- name: task
dtype: string
- name: level
dtype: string
- name: guided_question
dtype: string
- name: generic_question
dtype: string
- name: images
sequence: image
- name: question
dtype: string
- name: answer
dtype: string
splits:
- name: test
num_examples: 374
- config_name: level_3_Independent_count_object_color
features:
- name: id
dtype: string
- name: task
dtype: string
- name: level
dtype: string
- name: guided_question
dtype: string
- name: generic_question
dtype: string
- name: images
sequence: image
- name: question
dtype: string
- name: answer
dtype: string
splits:
- name: test
num_examples: 479
configs:
- config_name: level_1_global_occlusion_50
data_files:
- split: test
path: level_1_global_occlusion_50percent/test-*
- config_name: level_1_global_occlusion_70
data_files:
- split: test
path: level_1_global_occlusion_70percent/test-*
- config_name: level_1_global_occlusion_80
data_files:
- split: test
path: level_1_global_occlusion_80percent/test-*
- config_name: level_1_edges
data_files:
- split: test
path: level_1_edges_eval_6k_location_only_dino_mode_options/test-*
- config_name: level_1_location_random_sampling
data_files:
- split: test
path: level_1_eval_6k_location_only_random_sampling/test-*
- config_name: level_1_brightness
data_files:
- split: test
path: level_1_eval_6k_brightness_dino_options/test-*
- config_name: level_1_blur
data_files:
- split: test
path: level_1_eval_6k_blur_dino_options/test-*
- config_name: level_1_rotation
data_files:
- split: test
path: level_1_eval_6k_rotation_direct_dino_options/test-*
- config_name: level_1_rotation_random_sampling
data_files:
- split: test
path: level_1_eval_6k_single_rotation_same_options/test-*
- config_name: level_1_edges_random_sampling
data_files:
- split: test
path: level_1_edges_eval_6k_location_only_random_sampling/test-*
- config_name: level_1_location
data_files:
- split: test
path: level_1_eval_6k_location_only_dino_mode_options/test-*
- config_name: level_2_uniform_count
data_files:
- split: test
path: level_2_count_only/test-*
- config_name: level_2_count_progression
data_files:
- split: test
path: level_2_count_progression_mixed/test-*
- config_name: level_2_uniform_orientation
data_files:
- split: test
path: level_2_orientation_only/test-*
- config_name: level_2_count_2_same_1_diff
data_files:
- split: test
path: level_2_count_distribution_2same_1diff/test-*
- config_name: level_2_orientation_2same_1diff
data_files:
- split: test
path: level_2_orientation_distribution_2same_1diff/test-*
- config_name: level_2_uniform_color
data_files:
- split: test
path: level_2_color_only/test-*
- config_name: level_2_count_arithmetic
data_files:
- split: test
path: level_2_count_operations/test-*
- config_name: level_2_count_minmax
data_files:
- split: test
path: level_2_count_minmax/test-*
- config_name: level_2_orientation_3_diff
data_files:
- split: test
path: level_2_orientation_distribution/test-*
- config_name: level_2_color_2same_1diff
data_files:
- split: test
path: level_2_color_distribution_2same_1diff/test-*
- config_name: level_2_color_3_diff
data_files:
- split: test
path: level_2_color_distribution/test-*
- config_name: level_2_count_3_diff
data_files:
- split: test
path: level_2_count_distribution/test-*
- config_name: level_3_spiral_color_orientation
data_files:
- split: test
path: level_3_compositional_spiral_orientation/test-*
- config_name: level_3_spiral_color_orientation
data_files:
- split: test
path: level_3_compositional_spiral_object_color/test-*
- config_name: level_3_coupled_color_count
data_files:
- split: test
path: level_3_coupled_count_color/test-*
- config_name: level_3_independent_color_object_rientation
data_files:
- split: test
path: level_3_independent_color_object_orientation/test-*
- config_name: level_3_coupled_color_orientation
data_files:
- split: test
path: level_3_coupled_orientation_color/test-*
- config_name: level_3_Independent_count_object_color
data_files:
- split: test
path: level_3_independent_distribution_arithmetic_object/test-*
---
# VisRes Bench
[](https://visres-bench.github.io/) [](https://arxiv.org/abs/2512.21194)
**VisRes Bench** is a benchmark for evaluating the **visual reasoning** capabilities of Vision-Language Models (VLMs) in naturalistic settings without contextual language supervision. It is introduced in the paper [*VisRes Bench: On Evaluating the Visual Reasoning Capabilities of VLMs*](https://arxiv.org/abs/2512.21194).
## Paper Summary
Vision-Language Models excel at captioning and VQA, but it is unclear how much they rely on **visual reasoning** versus **linguistic priors**. VisRes addresses this by using **image-only, four-choice** tasks on **real-world images** (~19,000 samples) so that performance reflects visual reasoning rather than textual shortcuts.
The benchmark is organized in **three levels** of increasing complexity:
- **Level 1 — Perceptual grounding:** Local patch completion (masked tile + 4 candidate patches) under perturbations (blur, brightness, rotation, edges, location) and global occlusion (50% or 80% of the image masked). Tests robustness and amodal completion.
- **Level 2 — Single-attribute rule:** Raven-style 3×3 grids with one missing cell; one attribute (color, count, or orientation) follows a row-wise rule. Includes uniform, 3-different, 2-similar-1-different, count progression, arithmetic, and min-max subtasks (~5,956 samples).
- **Level 3 — Multi-attribute composition:** Same 3×3 format but multiple attributes (color, count, orientation, object identity) with row-wise, grid-wise, or spiral rules (~2,522 samples).
**Main findings:** State-of-the-art VLMs perform near **random (25%)** on many subtasks under subtle perceptual changes. Performance is stronger on color than count, and weakest on orientation. When the same logical structure is given as **text**, models do much better, indicating a **visual-to-symbolic** bottleneck rather than a pure reasoning limit. Higher resolution and guided/thinking prompts help but do not close the gap to human baselines.
---
## Main Results (Guided Prompting, Thinking Mode When Available)
Accuracy (%) across levels and subtasks. Random chance = 25%.
<table>
<thead>
<tr>
<th>Setting</th>
<th>GPT-5</th>
<th>GPT-4o</th>
<th>Gemini-2.5</th>
<th>Qwen3-VL-4B</th>
<th>Qwen3-VL-30B</th>
<th>Mimo-VL-7B</th>
</tr>
</thead>
<tbody>
<tr><td colspan="7"><strong>Level-1</strong></td></tr>
<tr><td>Edges</td><td>27.17</td><td>23.91</td><td>25.00</td><td>16.67</td><td>25.00</td><td>22.30</td></tr>
<tr><td>Location</td><td>23.71</td><td>20.62</td><td>26.00</td><td>23.16</td><td>22.40</td><td>25.77</td></tr>
<tr><td>Rotation</td><td>35.42</td><td>26.04</td><td>34.38</td><td>37.50</td><td>36.05</td><td>29.17</td></tr>
<tr><td>Brightness</td><td>25.26</td><td>27.37</td><td>27.37</td><td>31.52</td><td>29.47</td><td>27.37</td></tr>
<tr><td>Blur</td><td>31.18</td><td>25.26</td><td>26.32</td><td>24.73</td><td>24.28</td><td>26.32</td></tr>
<tr><td>Global@50%</td><td>42.86</td><td>20.88</td><td>57.14</td><td>37.50</td><td>47.25</td><td>48.35</td></tr>
<tr><td>Global@80%</td><td>32.61</td><td>22.83</td><td>36.96</td><td>25.88</td><td>35.87</td><td>30.43</td></tr>
<tr><td><strong>Level-1 Average</strong></td><td><strong>31.10</strong></td><td><strong>23.86</strong></td><td><strong>33.28</strong></td><td><strong>28.17</strong></td><td><strong>31.20</strong></td><td><strong>29.22</strong></td></tr>
<tr><td colspan="7"><strong>Level-2</strong></td></tr>
<tr><td>Uniform Color</td><td>96.00</td><td>21.00</td><td>97.00</td><td>66.20</td><td>88.00</td><td>78.95</td></tr>
<tr><td>Uniform Count</td><td>61.00</td><td>25.00</td><td>90.91</td><td>40.82</td><td>59.00</td><td>52.75</td></tr>
<tr><td>Uniform Orientation</td><td>22.22</td><td>25.25</td><td>26.53</td><td>26.00</td><td>23.00</td><td>19.19</td></tr>
<tr><td>Count Progression</td><td>50.00</td><td>13.00</td><td>77.00</td><td>37.20</td><td>48.00</td><td>36.96</td></tr>
<tr><td>Count Arithmetic</td><td>52.00</td><td>22.00</td><td>75.76</td><td>43.20</td><td>49.00</td><td>33.33</td></tr>
<tr><td><strong>Level-2 Average</strong></td><td><strong>49.79</strong></td><td><strong>24.12</strong></td><td><strong>62.29</strong></td><td><strong>37.18</strong></td><td><strong>46.75</strong></td><td><strong>39.15</strong></td></tr>
<tr><td colspan="7"><strong>Level-3</strong></td></tr>
<tr><td>Independent Color-Object-Orientation</td><td>34.00</td><td>25.25</td><td>38.00</td><td>27.39</td><td>32.60</td><td>19.00</td></tr>
<tr><td>Independent Count-Object-Color</td><td>34.00</td><td>24.00</td><td>44.00</td><td>29.45</td><td>36.34</td><td>29.00</td></tr>
<tr><td>Coupled Color-Orientation</td><td>24.24</td><td>24.00</td><td>16.33</td><td>26.13</td><td>29.43</td><td>20.00</td></tr>
<tr><td>Coupled Color-Count</td><td>30.00</td><td>22.00</td><td>21.21</td><td>27.46</td><td>33.33</td><td>28.00</td></tr>
<tr><td>Spiral Color-Count-Object</td><td>56.00</td><td>30.00</td><td>54.17</td><td>28.63</td><td>36.00</td><td>33.00</td></tr>
<tr><td><strong>Level-3 Average</strong></td><td><strong>34.39</strong></td><td><strong>23.86</strong></td><td><strong>33.73</strong></td><td><strong>26.31</strong></td><td><strong>31.36</strong></td><td><strong>25.17</strong></td></tr>
</tbody>
</table>
---
## Finetuning on Level-1 (Qwen2.5-VL-3B)
<table>
<thead>
<tr>
<th>Setting</th>
<th>Original</th>
<th>Finetuned</th>
<th>Human Baseline</th>
</tr>
</thead>
<tbody>
<tr><td>Location</td><td>24.3</td><td>42.8</td><td>94.1</td></tr>
<tr><td>Blur</td><td>23.9</td><td>37.5</td><td>84.3</td></tr>
<tr><td>Brightness</td><td>23.7</td><td>39.8</td><td>85.6</td></tr>
<tr><td>Rotation</td><td>25.5</td><td>50.8</td><td>92.0</td></tr>
<tr><td>Edges</td><td>25.1</td><td>33.2</td><td>82.6</td></tr>
<tr><td>Global (50%)</td><td>24.9</td><td>52.2</td><td>96.1</td></tr>
<tr><td>Global (80%)</td><td>23.9</td><td>38.6</td><td>98.0</td></tr>
<tr><td><strong>Average</strong></td><td><strong>24.5</strong></td><td><strong>43.7</strong></td><td><strong>90.4</strong></td></tr>
</tbody>
</table>
---
## Single-Attribute Recognition (Perceptual Grounding)
Accuracy (%) when models are asked to report a single attribute (color, orientation, or count) for one grid cell.
<table>
<thead>
<tr>
<th>Attribute</th>
<th>GPT-4o</th>
<th>GPT-5</th>
</tr>
</thead>
<tbody>
<tr><td>Color</td><td>84.6</td><td>97.6</td></tr>
<tr><td>Orientation</td><td>39.8</td><td>49.6</td></tr>
<tr><td>Count</td><td>72.4</td><td>94.2</td></tr>
</tbody>
</table>
---
## Impact of Thinking Mode
Accuracy (%) with thinking mode enabled (✓) vs disabled (✗). Open-source models improve substantially with thinking.
<table>
<thead>
<tr>
<th>Level</th>
<th>GPT-5 (high)</th>
<th>GPT-5 (low)</th>
<th>Mimo-VL ✓</th>
<th>Mimo-VL ✗</th>
<th>Qwen3-4B ✓</th>
<th>Qwen3-4B ✗</th>
<th>Qwen3-30B ✓</th>
<th>Qwen3-30B ✗</th>
</tr>
</thead>
<tbody>
<tr><td>Level-1</td><td>32.61</td><td>31.43</td><td>29.22</td><td>23.91</td><td>28.17</td><td>23.16</td><td>31.20</td><td>23.60</td></tr>
<tr><td>Level-2</td><td>49.79</td><td>47.01</td><td>39.15</td><td>26.68</td><td>37.18</td><td>24.08</td><td>46.75</td><td>28.25</td></tr>
<tr><td>Level-3</td><td>34.39</td><td>32.89</td><td>25.17</td><td>25.23</td><td>26.31</td><td>23.50</td><td>31.36</td><td>24.00</td></tr>
</tbody>
</table>
---
## Impact of Image Resolution (GPT-5)
Accuracy (%) at different input resolutions. All levels improve with higher resolution.
<table>
<thead>
<tr>
<th>Resolution</th>
<th>Level-1</th>
<th>Level-2</th>
<th>Level-3</th>
</tr>
</thead>
<tbody>
<tr><td>512×512</td><td>45.17</td><td>42.83</td><td>31.63</td></tr>
<tr><td>1024×1024</td><td>54.01</td><td>49.61</td><td>35.48</td></tr>
<tr><td>2048×2048</td><td>56.51</td><td>48.99</td><td>40.07</td></tr>
</tbody>
</table>
---
## Citation
```bibtex
@article{visres2025,
title={VisRes Bench: On Evaluating the Visual Reasoning Capabilities of VLMs},
author={Malagurski T{\"o}rtei, Brigitta and Dahou, Yasser and Huynh, Ngoc Dung and others},
journal={arXiv preprint arXiv:2512.21194},
year={2025}
}
```
提供机构:
tiiuae



