Relation-Associated Instructions & Hallucination Benchmark
收藏ieee-dataport.org2025-01-22 收录
下载链接:
https://ieee-dataport.org/documents/relation-associated-instructions-hallucination-benchmark
下载链接
链接失效反馈官方服务:
资源简介:
Large vision-language models (LVLMs) suffer from hallucination, generating responses that apparently contradict to the image content occasionally. The key problem lies in its weak ability to comprehend detailed content in multi-modal contexts, which can be mainly attributed its training data. The vision instruction dataset primarily focuses on global description that are highly relevant to the image, with few samples containing image details. Therefore, we construct a fine-grained vision instruction dataset, RAI-30k, by generate image-text pairs with detailed relationship annotations in panoptic scene graph dataset (PSG). These conversations pay more attention on detailed facts in the image, encouraging the model to answer questions based on multi-modal contexts. Moreover, to provide a deeper evaluation on the hallucination in LVLMs, we propose a new benchmark, RAH-Bench. It divides vision hallucination into three different types that contradicts the image with wrong categories, attributes or relations, and introduces False Positive Rate as detailed sub-metric for each type. We hope the provided dataset and benchmark will benefit the future research in large vision-language models.
大型视觉-语言模型(LVLMs)在生成回应时偶有出现与图像内容相悖的现象,此现象被称为幻觉。其核心问题在于模型在多模态情境下理解详细内容的能力较弱,这一缺陷主要源于其训练数据。视觉指令数据集主要关注与图像高度相关的全局描述,其中包含图像细节的样本较少。因此,我们构建了一个细粒度的视觉指令数据集RAI-30k,通过在全景场景图数据集(PSG)中生成带有详细关系标注的图像-文本对。这些对话更加注重图像中的细节事实,鼓励模型基于多模态情境进行回答。此外,为了对LVLMs中的幻觉进行更深入的评估,我们提出了一个新的基准测试RAH-Bench。该基准测试将视觉幻觉分为三种不同类型,这些类型与图像中的错误分类、属性或关系相矛盾,并为每种类型引入了误报率作为详细的子指标。我们期望所提供的数据集和基准测试将有助于未来在大规模视觉-语言模型领域的研究。
提供机构:
IEEE Dataport



