Multimodal 3D object detection for autonomous driving under vision-language supervision: a contrastive-learning perspective

中国科学数据2026-04-16 更新2026-04-25 收录

下载链接：

https://www.sciengine.com/AA/doi/10.1007/s11432-025-4853-5

下载链接

链接失效反馈

官方服务：

资源简介：

Multimodal large language models (MLLMs) have been well acknowledged as the generalist across a broad spectrum of vision-language understanding tasks. Despite notable advancements, their potential for autonomous-driving perception remains largely underexplored. In response, we conduct an in-depth investigation of image-text-point interaction and propose a versatile paradigm of vision-language supervision (VLS) for 3D object detection, where multi-sensory proposals are primarily refined with meticulously-designed text-referred expression, and multimodal correspondences are further incorporated in a contrastive-learning manner. Moreover, VLS holds great advantages. (1) No complicated engineering.It could be seamlessly integrated into a camera-LiDAR 3D detector without troublesome hand-crafted engineering. (2) No extra computation.It provides auxiliary guidance only during training.łinebreak (3) No additional data.It derives multimodal pairs from ground-truth label instead of a laborious annotation pipeline. Empirical study on publicly available KITTI and nuScenes benchmarks demonstrates the state-of-the-art detection performance against a wide span of counterparts, suggesting its effectiveness and advancement. We hope this work could pave a substantial path towards multimodal feature fusion and object detection for autonomous driving.

创建时间：

2026-03-11

5,000+

优质数据集

54 个

任务类型

进入经典数据集