Multimodal prior-augmented text-driven 3D human-object interaction generation
收藏中国科学数据2026-04-17 更新2026-04-25 收录
下载链接:
https://www.sciengine.com/AA/doi/10.1007/s11432-025-4809-7
下载链接
链接失效反馈官方服务:
资源简介:
We address the challenging task of text-driven 3D human-object interaction (HOI) motion generation. Existing methods primarily rely on a direct text-to-HOI mapping, which suffers from three key limitations due to the significant cross-modality gap: (Q1) sub-optimal human motion, (Q2) unnatural object motion, and (Q3) weak interaction between humans and objects.To address these challenges, we propose MP-HOI, a novel framework grounded in four core insights.(1) Multimodal data priors: We leverage multimodal data (text, image, pose/object) from large multimodal models as priors to guide HOI generation, which tackles (Q1) and (Q2) in data modeling.(2) Enhanced object representation: We improve existing object representations by incorporating geometric keypoints, contact features, and dynamic properties, enabling expressive object representations, which tackle (Q2) in data representation.(3) Multimodal-aware mixture-of-experts (MoE) model: We propose a modality-aware MoE model for an effective multimodal feature fusion paradigm, which tackles (Q1) and (Q2) in feature fusion.(4) Cascaded diffusion with interaction supervision: We design a cascaded diffusion framework that progressively refines human-object interaction features under dedicated supervision, which tackles (Q3) in interaction refinement.Comprehensive experiments demonstrate that MP-HOI outperforms existing approaches in generating high-fidelity and fine-grained HOI motions.
创建时间:
2026-02-27



