PGPL: enhancing spatial awareness abilities of multimodal large language models based on precise geometric position learning
收藏中国科学数据2025-10-10 更新2026-04-25 收录
下载链接:
https://www.sciengine.com/AA/doi/10.1007/s11432-024-4416-8
下载链接
链接失效反馈官方服务:
资源简介:
Multimodal large language models (MLLMs) have already begun to be used in visual question answering (VQA), autonomous driving, and smart healthcare, showing great application potential.However, existing MLLMs have significant gaps compared with human intelligence in terms of spatial awareness tasks, especially in accurately identifying and interpreting complex spatial relationships between target entities.This deficiency severely impacts the accuracy of VQA, the safety of autonomous driving, and the reliability of smart healthcare.In order to meet the requirements for the accuracy of spatial relationship recognition in specific applications, we propose a novel framework named PGPL which attempts to enhance the spatial awareness ability of an MLLM by integrating precise geometric position information between target entities on the MLLM without the need for additional training of the MLLM.Specifically, the PGPL framework leverages the spatial position generation model and the scene graph generation model to obtain geometric absolute position and geometric relative position of the target entities in the visual input.And then, it introduces a multidimensional information fusion strategy to guide the MLLM to accurately answer questions related to spatial awareness.The quantitative experimental results of six popular datasets and twelve MLLMs, as well as the related qualitative experimental results, fully demonstrate the importance of the precise geometric position information for correctly answering spatial awareness questions, and demonstrate the superiority of the PGPL framework.
创建时间:
2025-04-29



