five

UltraVR: Where Do Vision-Language Models Fail in Ultra-Resolution Visual Reasoning?

收藏
DataCite Commons2026-05-06 更新2026-05-18 收录
下载链接:
https://dataverse.harvard.edu/citation?persistentId=doi:10.7910/DVN/3EW3RA
下载链接
链接失效反馈
官方服务:
资源简介:
Vision-language models (VLMs) have advanced rapidly on visual question answering and multimodal reasoning benchmarks. Yet it remains unclear whether they can reason over ultra-resolution images, where answer-critical evidence may be tiny, subtle, spatially distant, or distributed across regions. Existing evaluations largely report final-answer accuracy, offering limited insight into whether models acquire and integrate the necessary visual evidence. We introduce $UltraVR$, a diagnostic benchmark for evidence-grounded visual reasoning over ultra-resolution images. UltraVR spans four high-value scenarios: Closed-Circuit Television (CCTV) surveillance, remote sensing (RS), whole-slide image (WSI) pathology, and industrial anomaly detection (AD). These domains pose complementary challenges, including fine-grained object grounding in crowded CCTV scenes, long-range spatial comparison in RS, multi-scale evidence navigation in WSI, and subtle irregularity detection in repetitive industrial layouts. Beyond image-question-answer triples, each UltraVR instance includes a structured ground-truth chain of thought with step-level questions, intermediate answers, and operation labels. These labels decompose reasoning into evidence grounding, local perception, quantification, evidence integration, and decision inference, enabling process-level diagnosis rather than black-box answer scoring. Using UltraVR, we benchmark frontier proprietary and open-weight VLMs and find that even the strongest model reaches only about 44% accuracy, revealing a substantial gap in ultra-resolution visual reasoning. This gap is not reliably closed by simple remedies such as local cropping or step-by-step prompting. Leveraging UltraVR's structured annotations, we find that correcting intermediate visual facts raises 49% final-answer accuracy, suggesting that downstream inference is often recoverable once the necessary visual facts are correctly supplied. Finally, operation-level analysis localizes the bottleneck to early evidence acquisition and local perception, revealing that current VLMs often fail not because they cannot reason from visual facts, but because they cannot reliably find and perceive those facts in ultra-resolution images.
提供机构:
Harvard Dataverse
创建时间:
2026-05-06
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作