Irrelevance Robust Visual Question Answering (IR-VQA) (v2)
收藏IEEE2026-04-17 收录
下载链接:
https://ieee-dataport.org/documents/irrelevance-robust-visual-question-answering-ir-vqa-v2
下载链接
链接失效反馈官方服务:
资源简介:
Large Vision-Language Models (LVLMs) with \multimodal distractibility,\ where plausible but irrelevant visual or textual inputs cause significant drops in reasoning consistency and lead to unreliable outputs. This paper introduces a comprehensive framework to systematically diagnose, evaluate, and mitigate this critical challenge. We present three core components: the large-scale IR-VQA benchmark to surface these vulnerabilities across four paradigms; novel diagnostic metrics, Positive Consistency (PC) and Negative Consistency (NC), which move beyond standard accuracy to rigorously measure a model's reasoning stability; and the Relevance-Gated Multimodal Routing (RGMR) mechanism, a novel, lightweight module that proactively and dynamically filters distractions at inference time. Our experiments reveal that state-of-the-art models exhibit significant drops in consistency on IR-VQA. We demonstrate that finetuning on IR-VQA and deploying RGMR substantially improve model robustness where standard prompting fails. Our comprehensive analysis of model behaviors under different types of distractions and the underlying reasoning failures provides a clear path forward for developing more reliable multimodal systems.
提供机构:
Jinhui Yang; Qi Zhao; Ming Jiang



