five

M3Fusion: a unified multi-view multi-modality fusion framework for embodied 3D perception

收藏
中国科学数据2026-02-05 更新2026-04-25 收录
下载链接:
https://www.sciengine.com/AA/doi/10.1360/SSI-2025-0293
下载链接
链接失效反馈
官方服务:
资源简介:
Embodied 3D perception requires understanding dynamic environments from ego-centric viewpoints using natural language instructions. However, current methods leveraging large language models (LLMs) for embodied 3D perception remain limited. Some approaches suffer from restricted semantic output and localization accuracy, while other LLM approaches lack a unified encoder for effectively aggregating multi-view semantic and geometric features which are essential for precise language alignment and localization. To address this, we introduce M3Fusion, the first end-to-end framework for unified multi-view multi-modality fusion in embodied 3D perception. M3Fusion tightly integrates 2D visual semantics and 3D geometric features from multiple ego-centric views, projecting them into a shared 3D space to form unified M3-tokens. These tokens enable seamless alignment with language for complex instruction understanding and simultaneously decode accurate 3D bounding boxes. We propose a specialized 3-stage training strategy to align modalities. Evaluations on 3D visual grounding and QA datasets demonstrate significant performance improvements, particularly in grounding accuracy while maintaining QA capability, highlighting our unified representation and framework design.
创建时间:
2026-01-19
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作