Multimodal 3D object detection for autonomous driving under vision-language supervision: a contrastive-learning perspective
收藏中国科学数据2026-04-16 更新2026-04-25 收录
下载链接:
https://www.sciengine.com/AA/doi/10.1007/s11432-025-4853-5
下载链接
链接失效反馈官方服务:
资源简介:
Multimodal large language models (MLLMs) have been well acknowledged as the generalist across a broad spectrum of vision-language understanding tasks. Despite notable advancements, their potential for autonomous-driving perception remains largely underexplored. In response, we conduct an in-depth investigation of image-text-point interaction and propose a versatile paradigm of vision-language supervision (VLS) for 3D object detection, where multi-sensory proposals are primarily refined with meticulously-designed text-referred expression, and multimodal correspondences are further incorporated in a contrastive-learning manner. Moreover, VLS holds great advantages. (1) No complicated engineering.It could be seamlessly integrated into a camera-LiDAR 3D detector without troublesome hand-crafted engineering. (2) No extra computation.It provides auxiliary guidance only during training.łinebreak (3) No additional data.It derives multimodal pairs from ground-truth label instead of a laborious annotation pipeline. Empirical study on publicly available KITTI and nuScenes benchmarks demonstrates the state-of-the-art detection performance against a wide span of counterparts, suggesting its effectiveness and advancement. We hope this work could pave a substantial path towards multimodal feature fusion and object detection for autonomous driving.
创建时间:
2026-03-11



