M3Fusion: a unified multi-view multi-modality fusion framework for embodied 3D perception

中国科学数据2026-02-05 更新2026-04-25 收录

下载链接：

https://www.sciengine.com/AA/doi/10.1360/SSI-2025-0293

下载链接

链接失效反馈

官方服务：

资源简介：

Embodied 3D perception requires understanding dynamic environments from ego-centric viewpoints using natural language instructions. However, current methods leveraging large language models (LLMs) for embodied 3D perception remain limited. Some approaches suffer from restricted semantic output and localization accuracy, while other LLM approaches lack a unified encoder for effectively aggregating multi-view semantic and geometric features which are essential for precise language alignment and localization. To address this, we introduce M3Fusion, the first end-to-end framework for unified multi-view multi-modality fusion in embodied 3D perception. M3Fusion tightly integrates 2D visual semantics and 3D geometric features from multiple ego-centric views, projecting them into a shared 3D space to form unified M3-tokens. These tokens enable seamless alignment with language for complex instruction understanding and simultaneously decode accurate 3D bounding boxes. We propose a specialized 3-stage training strategy to align modalities. Evaluations on 3D visual grounding and QA datasets demonstrate significant performance improvements, particularly in grounding accuracy while maintaining QA capability, highlighting our unified representation and framework design.

创建时间：

2026-01-19

5,000+

优质数据集

54 个

任务类型

进入经典数据集