five

tiiuae/visres_bench

收藏
Hugging Face2026-03-10 更新2026-05-10 收录
下载链接:
https://hf-mirror.com/datasets/tiiuae/visres_bench
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - visual-question-answering - image-to-text language: - en tags: - benchmark - vision - reasoning - multimodal - evaluation pretty_name: VisRes-Bench dataset_info: - config_name: level_1_global_occlusion_50 features: - name: id dtype: string - name: task dtype: string - name: level dtype: string - name: guided_question dtype: string - name: generic_question dtype: string - name: images sequence: image - name: question dtype: string - name: answer dtype: string splits: - name: test num_examples: 1000 - config_name: level_1_global_occlusion_70 features: - name: id dtype: string - name: task dtype: string - name: level dtype: string - name: guided_question dtype: string - name: generic_question dtype: string - name: images sequence: image - name: question dtype: string - name: answer dtype: string splits: - name: test num_examples: 1000 - config_name: level_1_global_occlusion_80 features: - name: id dtype: string - name: task dtype: string - name: level dtype: string - name: guided_question dtype: string - name: generic_question dtype: string - name: images sequence: image - name: question dtype: string - name: answer dtype: string splits: - name: test num_examples: 1000 - config_name: level_1_edges features: - name: id dtype: string - name: task dtype: string - name: level dtype: string - name: guided_question dtype: string - name: generic_question dtype: string - name: images sequence: image - name: question dtype: string - name: answer dtype: string splits: - name: test num_examples: 1000 - config_name: level_1_location_random_sampling features: - name: id dtype: string - name: task dtype: string - name: level dtype: string - name: guided_question dtype: string - name: generic_question dtype: string - name: images sequence: image - name: question dtype: string - name: answer dtype: string splits: - name: test num_examples: 1000 - config_name: level_1_brightness features: - name: id dtype: string - name: task dtype: string - name: level dtype: string - name: guided_question dtype: string - name: generic_question dtype: string - name: images sequence: image - name: question dtype: string - name: answer dtype: string splits: - name: test num_examples: 1000 - config_name: level_1_blur features: - name: id dtype: string - name: task dtype: string - name: level dtype: string - name: guided_question dtype: string - name: generic_question dtype: string - name: images sequence: image - name: question dtype: string - name: answer dtype: string splits: - name: test num_examples: 1000 - config_name: level_1_rotation features: - name: id dtype: string - name: task dtype: string - name: level dtype: string - name: guided_question dtype: string - name: generic_question dtype: string - name: images sequence: image - name: question dtype: string - name: answer dtype: string splits: - name: test num_examples: 1000 - config_name: level_1_rotation_random_sampling features: - name: id dtype: string - name: task dtype: string - name: level dtype: string - name: guided_question dtype: string - name: generic_question dtype: string - name: images sequence: image - name: question dtype: string - name: answer dtype: string splits: - name: test num_examples: 1000 - config_name: level_1_edges_random_sampling features: - name: id dtype: string - name: task dtype: string - name: level dtype: string - name: guided_question dtype: string - name: generic_question dtype: string - name: images sequence: image - name: question dtype: string - name: answer dtype: string splits: - name: test num_examples: 1000 - config_name: level_1_location features: - name: id dtype: string - name: task dtype: string - name: level dtype: string - name: guided_question dtype: string - name: generic_question dtype: string - name: images sequence: image - name: question dtype: string - name: answer dtype: string splits: - name: test num_examples: 1000 - config_name: level_2_uniform_count features: - name: id dtype: string - name: task dtype: string - name: level dtype: string - name: guided_question dtype: string - name: generic_question dtype: string - name: images sequence: image - name: question dtype: string - name: answer dtype: string splits: - name: test num_examples: 500 - config_name: level_2_count_progression features: - name: id dtype: string - name: task dtype: string - name: level dtype: string - name: guided_question dtype: string - name: generic_question dtype: string - name: images sequence: image - name: question dtype: string - name: answer dtype: string splits: - name: test num_examples: 500 - config_name: level_2_uniform_orientation features: - name: id dtype: string - name: task dtype: string - name: level dtype: string - name: guided_question dtype: string - name: generic_question dtype: string - name: images sequence: image - name: question dtype: string - name: answer dtype: string splits: - name: test num_examples: 458 - config_name: level_2_count_2_same_1_diff features: - name: id dtype: string - name: task dtype: string - name: level dtype: string - name: guided_question dtype: string - name: generic_question dtype: string - name: images sequence: image - name: question dtype: string - name: answer dtype: string splits: - name: test num_examples: 500 - config_name: level_2_orientation_2same_1diff features: - name: id dtype: string - name: task dtype: string - name: level dtype: string - name: guided_question dtype: string - name: generic_question dtype: string - name: images sequence: image - name: question dtype: string - name: answer dtype: string splits: - name: test num_examples: 498 - config_name: level_2_uniform_color features: - name: id dtype: string - name: task dtype: string - name: level dtype: string - name: guided_question dtype: string - name: generic_question dtype: string - name: images sequence: image - name: question dtype: string - name: answer dtype: string splits: - name: test num_examples: 500 - config_name: level_2_count_arithmetic features: - name: id dtype: string - name: task dtype: string - name: level dtype: string - name: guided_question dtype: string - name: generic_question dtype: string - name: images sequence: image - name: question dtype: string - name: answer dtype: string splits: - name: test num_examples: 500 - config_name: level_2_count_minmax features: - name: id dtype: string - name: task dtype: string - name: level dtype: string - name: guided_question dtype: string - name: generic_question dtype: string - name: images sequence: image - name: question dtype: string - name: answer dtype: string splits: - name: test num_examples: 500 - config_name: level_2_orientation_3_diff features: - name: id dtype: string - name: task dtype: string - name: level dtype: string - name: guided_question dtype: string - name: generic_question dtype: string - name: images sequence: image - name: question dtype: string - name: answer dtype: string splits: - name: test num_examples: 500 - config_name: level_2_color_2same_1diff features: - name: id dtype: string - name: task dtype: string - name: level dtype: string - name: guided_question dtype: string - name: generic_question dtype: string - name: images sequence: image - name: question dtype: string - name: answer dtype: string splits: - name: test num_examples: 500 - config_name: level_2_color_3_diff features: - name: id dtype: string - name: task dtype: string - name: level dtype: string - name: guided_question dtype: string - name: generic_question dtype: string - name: images sequence: image - name: question dtype: string - name: answer dtype: string splits: - name: test num_examples: 500 - config_name: level_2_count_3_diff features: - name: id dtype: string - name: task dtype: string - name: level dtype: string - name: guided_question dtype: string - name: generic_question dtype: string - name: images sequence: image - name: question dtype: string - name: answer dtype: string splits: - name: test num_examples: 500 - config_name: level_3_spiral_color_orientation features: - name: id dtype: string - name: task dtype: string - name: level dtype: string - name: guided_question dtype: string - name: generic_question dtype: string - name: images sequence: image - name: question dtype: string - name: answer dtype: string splits: - name: test num_examples: 350 - config_name: level_3_spiral_color_orientation features: - name: id dtype: string - name: task dtype: string - name: level dtype: string - name: guided_question dtype: string - name: generic_question dtype: string - name: images sequence: image - name: question dtype: string - name: answer dtype: string splits: - name: test num_examples: 464 - config_name: level_3_coupled_color_count features: - name: id dtype: string - name: task dtype: string - name: level dtype: string - name: guided_question dtype: string - name: generic_question dtype: string - name: images sequence: image - name: question dtype: string - name: answer dtype: string splits: - name: test num_examples: 500 - config_name: level_3_independent_color_object_orientation features: - name: id dtype: string - name: task dtype: string - name: level dtype: string - name: guided_question dtype: string - name: generic_question dtype: string - name: images sequence: image - name: question dtype: string - name: answer dtype: string splits: - name: test num_examples: 355 - config_name: level_3_coupled_color_orientation features: - name: id dtype: string - name: task dtype: string - name: level dtype: string - name: guided_question dtype: string - name: generic_question dtype: string - name: images sequence: image - name: question dtype: string - name: answer dtype: string splits: - name: test num_examples: 374 - config_name: level_3_Independent_count_object_color features: - name: id dtype: string - name: task dtype: string - name: level dtype: string - name: guided_question dtype: string - name: generic_question dtype: string - name: images sequence: image - name: question dtype: string - name: answer dtype: string splits: - name: test num_examples: 479 configs: - config_name: level_1_global_occlusion_50 data_files: - split: test path: level_1_global_occlusion_50percent/test-* - config_name: level_1_global_occlusion_70 data_files: - split: test path: level_1_global_occlusion_70percent/test-* - config_name: level_1_global_occlusion_80 data_files: - split: test path: level_1_global_occlusion_80percent/test-* - config_name: level_1_edges data_files: - split: test path: level_1_edges_eval_6k_location_only_dino_mode_options/test-* - config_name: level_1_location_random_sampling data_files: - split: test path: level_1_eval_6k_location_only_random_sampling/test-* - config_name: level_1_brightness data_files: - split: test path: level_1_eval_6k_brightness_dino_options/test-* - config_name: level_1_blur data_files: - split: test path: level_1_eval_6k_blur_dino_options/test-* - config_name: level_1_rotation data_files: - split: test path: level_1_eval_6k_rotation_direct_dino_options/test-* - config_name: level_1_rotation_random_sampling data_files: - split: test path: level_1_eval_6k_single_rotation_same_options/test-* - config_name: level_1_edges_random_sampling data_files: - split: test path: level_1_edges_eval_6k_location_only_random_sampling/test-* - config_name: level_1_location data_files: - split: test path: level_1_eval_6k_location_only_dino_mode_options/test-* - config_name: level_2_uniform_count data_files: - split: test path: level_2_count_only/test-* - config_name: level_2_count_progression data_files: - split: test path: level_2_count_progression_mixed/test-* - config_name: level_2_uniform_orientation data_files: - split: test path: level_2_orientation_only/test-* - config_name: level_2_count_2_same_1_diff data_files: - split: test path: level_2_count_distribution_2same_1diff/test-* - config_name: level_2_orientation_2same_1diff data_files: - split: test path: level_2_orientation_distribution_2same_1diff/test-* - config_name: level_2_uniform_color data_files: - split: test path: level_2_color_only/test-* - config_name: level_2_count_arithmetic data_files: - split: test path: level_2_count_operations/test-* - config_name: level_2_count_minmax data_files: - split: test path: level_2_count_minmax/test-* - config_name: level_2_orientation_3_diff data_files: - split: test path: level_2_orientation_distribution/test-* - config_name: level_2_color_2same_1diff data_files: - split: test path: level_2_color_distribution_2same_1diff/test-* - config_name: level_2_color_3_diff data_files: - split: test path: level_2_color_distribution/test-* - config_name: level_2_count_3_diff data_files: - split: test path: level_2_count_distribution/test-* - config_name: level_3_spiral_color_orientation data_files: - split: test path: level_3_compositional_spiral_orientation/test-* - config_name: level_3_spiral_color_orientation data_files: - split: test path: level_3_compositional_spiral_object_color/test-* - config_name: level_3_coupled_color_count data_files: - split: test path: level_3_coupled_count_color/test-* - config_name: level_3_independent_color_object_rientation data_files: - split: test path: level_3_independent_color_object_orientation/test-* - config_name: level_3_coupled_color_orientation data_files: - split: test path: level_3_coupled_orientation_color/test-* - config_name: level_3_Independent_count_object_color data_files: - split: test path: level_3_independent_distribution_arithmetic_object/test-* --- # VisRes Bench [![GitHub](https://img.shields.io/badge/GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://visres-bench.github.io/) [![arXiv](https://img.shields.io/badge/arXiv-2512.21194-b31b1b?style=flat-square&logo=arxiv&logoColor=white)](https://arxiv.org/abs/2512.21194) **VisRes Bench** is a benchmark for evaluating the **visual reasoning** capabilities of Vision-Language Models (VLMs) in naturalistic settings without contextual language supervision. It is introduced in the paper [*VisRes Bench: On Evaluating the Visual Reasoning Capabilities of VLMs*](https://arxiv.org/abs/2512.21194). ## Paper Summary Vision-Language Models excel at captioning and VQA, but it is unclear how much they rely on **visual reasoning** versus **linguistic priors**. VisRes addresses this by using **image-only, four-choice** tasks on **real-world images** (~19,000 samples) so that performance reflects visual reasoning rather than textual shortcuts. The benchmark is organized in **three levels** of increasing complexity: - **Level 1 — Perceptual grounding:** Local patch completion (masked tile + 4 candidate patches) under perturbations (blur, brightness, rotation, edges, location) and global occlusion (50% or 80% of the image masked). Tests robustness and amodal completion. - **Level 2 — Single-attribute rule:** Raven-style 3×3 grids with one missing cell; one attribute (color, count, or orientation) follows a row-wise rule. Includes uniform, 3-different, 2-similar-1-different, count progression, arithmetic, and min-max subtasks (~5,956 samples). - **Level 3 — Multi-attribute composition:** Same 3×3 format but multiple attributes (color, count, orientation, object identity) with row-wise, grid-wise, or spiral rules (~2,522 samples). **Main findings:** State-of-the-art VLMs perform near **random (25%)** on many subtasks under subtle perceptual changes. Performance is stronger on color than count, and weakest on orientation. When the same logical structure is given as **text**, models do much better, indicating a **visual-to-symbolic** bottleneck rather than a pure reasoning limit. Higher resolution and guided/thinking prompts help but do not close the gap to human baselines. --- ## Main Results (Guided Prompting, Thinking Mode When Available) Accuracy (%) across levels and subtasks. Random chance = 25%. <table> <thead> <tr> <th>Setting</th> <th>GPT-5</th> <th>GPT-4o</th> <th>Gemini-2.5</th> <th>Qwen3-VL-4B</th> <th>Qwen3-VL-30B</th> <th>Mimo-VL-7B</th> </tr> </thead> <tbody> <tr><td colspan="7"><strong>Level-1</strong></td></tr> <tr><td>Edges</td><td>27.17</td><td>23.91</td><td>25.00</td><td>16.67</td><td>25.00</td><td>22.30</td></tr> <tr><td>Location</td><td>23.71</td><td>20.62</td><td>26.00</td><td>23.16</td><td>22.40</td><td>25.77</td></tr> <tr><td>Rotation</td><td>35.42</td><td>26.04</td><td>34.38</td><td>37.50</td><td>36.05</td><td>29.17</td></tr> <tr><td>Brightness</td><td>25.26</td><td>27.37</td><td>27.37</td><td>31.52</td><td>29.47</td><td>27.37</td></tr> <tr><td>Blur</td><td>31.18</td><td>25.26</td><td>26.32</td><td>24.73</td><td>24.28</td><td>26.32</td></tr> <tr><td>Global@50%</td><td>42.86</td><td>20.88</td><td>57.14</td><td>37.50</td><td>47.25</td><td>48.35</td></tr> <tr><td>Global@80%</td><td>32.61</td><td>22.83</td><td>36.96</td><td>25.88</td><td>35.87</td><td>30.43</td></tr> <tr><td><strong>Level-1 Average</strong></td><td><strong>31.10</strong></td><td><strong>23.86</strong></td><td><strong>33.28</strong></td><td><strong>28.17</strong></td><td><strong>31.20</strong></td><td><strong>29.22</strong></td></tr> <tr><td colspan="7"><strong>Level-2</strong></td></tr> <tr><td>Uniform Color</td><td>96.00</td><td>21.00</td><td>97.00</td><td>66.20</td><td>88.00</td><td>78.95</td></tr> <tr><td>Uniform Count</td><td>61.00</td><td>25.00</td><td>90.91</td><td>40.82</td><td>59.00</td><td>52.75</td></tr> <tr><td>Uniform Orientation</td><td>22.22</td><td>25.25</td><td>26.53</td><td>26.00</td><td>23.00</td><td>19.19</td></tr> <tr><td>Count Progression</td><td>50.00</td><td>13.00</td><td>77.00</td><td>37.20</td><td>48.00</td><td>36.96</td></tr> <tr><td>Count Arithmetic</td><td>52.00</td><td>22.00</td><td>75.76</td><td>43.20</td><td>49.00</td><td>33.33</td></tr> <tr><td><strong>Level-2 Average</strong></td><td><strong>49.79</strong></td><td><strong>24.12</strong></td><td><strong>62.29</strong></td><td><strong>37.18</strong></td><td><strong>46.75</strong></td><td><strong>39.15</strong></td></tr> <tr><td colspan="7"><strong>Level-3</strong></td></tr> <tr><td>Independent Color-Object-Orientation</td><td>34.00</td><td>25.25</td><td>38.00</td><td>27.39</td><td>32.60</td><td>19.00</td></tr> <tr><td>Independent Count-Object-Color</td><td>34.00</td><td>24.00</td><td>44.00</td><td>29.45</td><td>36.34</td><td>29.00</td></tr> <tr><td>Coupled Color-Orientation</td><td>24.24</td><td>24.00</td><td>16.33</td><td>26.13</td><td>29.43</td><td>20.00</td></tr> <tr><td>Coupled Color-Count</td><td>30.00</td><td>22.00</td><td>21.21</td><td>27.46</td><td>33.33</td><td>28.00</td></tr> <tr><td>Spiral Color-Count-Object</td><td>56.00</td><td>30.00</td><td>54.17</td><td>28.63</td><td>36.00</td><td>33.00</td></tr> <tr><td><strong>Level-3 Average</strong></td><td><strong>34.39</strong></td><td><strong>23.86</strong></td><td><strong>33.73</strong></td><td><strong>26.31</strong></td><td><strong>31.36</strong></td><td><strong>25.17</strong></td></tr> </tbody> </table> --- ## Finetuning on Level-1 (Qwen2.5-VL-3B) <table> <thead> <tr> <th>Setting</th> <th>Original</th> <th>Finetuned</th> <th>Human Baseline</th> </tr> </thead> <tbody> <tr><td>Location</td><td>24.3</td><td>42.8</td><td>94.1</td></tr> <tr><td>Blur</td><td>23.9</td><td>37.5</td><td>84.3</td></tr> <tr><td>Brightness</td><td>23.7</td><td>39.8</td><td>85.6</td></tr> <tr><td>Rotation</td><td>25.5</td><td>50.8</td><td>92.0</td></tr> <tr><td>Edges</td><td>25.1</td><td>33.2</td><td>82.6</td></tr> <tr><td>Global (50%)</td><td>24.9</td><td>52.2</td><td>96.1</td></tr> <tr><td>Global (80%)</td><td>23.9</td><td>38.6</td><td>98.0</td></tr> <tr><td><strong>Average</strong></td><td><strong>24.5</strong></td><td><strong>43.7</strong></td><td><strong>90.4</strong></td></tr> </tbody> </table> --- ## Single-Attribute Recognition (Perceptual Grounding) Accuracy (%) when models are asked to report a single attribute (color, orientation, or count) for one grid cell. <table> <thead> <tr> <th>Attribute</th> <th>GPT-4o</th> <th>GPT-5</th> </tr> </thead> <tbody> <tr><td>Color</td><td>84.6</td><td>97.6</td></tr> <tr><td>Orientation</td><td>39.8</td><td>49.6</td></tr> <tr><td>Count</td><td>72.4</td><td>94.2</td></tr> </tbody> </table> --- ## Impact of Thinking Mode Accuracy (%) with thinking mode enabled (✓) vs disabled (✗). Open-source models improve substantially with thinking. <table> <thead> <tr> <th>Level</th> <th>GPT-5 (high)</th> <th>GPT-5 (low)</th> <th>Mimo-VL ✓</th> <th>Mimo-VL ✗</th> <th>Qwen3-4B ✓</th> <th>Qwen3-4B ✗</th> <th>Qwen3-30B ✓</th> <th>Qwen3-30B ✗</th> </tr> </thead> <tbody> <tr><td>Level-1</td><td>32.61</td><td>31.43</td><td>29.22</td><td>23.91</td><td>28.17</td><td>23.16</td><td>31.20</td><td>23.60</td></tr> <tr><td>Level-2</td><td>49.79</td><td>47.01</td><td>39.15</td><td>26.68</td><td>37.18</td><td>24.08</td><td>46.75</td><td>28.25</td></tr> <tr><td>Level-3</td><td>34.39</td><td>32.89</td><td>25.17</td><td>25.23</td><td>26.31</td><td>23.50</td><td>31.36</td><td>24.00</td></tr> </tbody> </table> --- ## Impact of Image Resolution (GPT-5) Accuracy (%) at different input resolutions. All levels improve with higher resolution. <table> <thead> <tr> <th>Resolution</th> <th>Level-1</th> <th>Level-2</th> <th>Level-3</th> </tr> </thead> <tbody> <tr><td>512×512</td><td>45.17</td><td>42.83</td><td>31.63</td></tr> <tr><td>1024×1024</td><td>54.01</td><td>49.61</td><td>35.48</td></tr> <tr><td>2048×2048</td><td>56.51</td><td>48.99</td><td>40.07</td></tr> </tbody> </table> --- ## Citation ```bibtex @article{visres2025, title={VisRes Bench: On Evaluating the Visual Reasoning Capabilities of VLMs}, author={Malagurski T{\"o}rtei, Brigitta and Dahou, Yasser and Huynh, Ngoc Dung and others}, journal={arXiv preprint arXiv:2512.21194}, year={2025} } ```
提供机构:
tiiuae
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作