Video Diffusion Models are Training-free Motion Interpreter and Controller

Name: Video Diffusion Models are Training-free Motion Interpreter and Controller
Creator: DR-NTU (Data)
Published: 2025-10-10 03:15:33
License: 暂无描述

DataCite Commons2025-10-10 更新2025-04-16 收录

下载链接：

https://researchdata.ntu.edu.sg/citation?persistentId=doi:10.21979/N9/HQM313

下载链接

链接失效反馈

官方服务：

资源简介：

Video generation primarily aims to model authentic and customized motion across frames, making understanding and controlling the motion a crucial topic. Most diffusion-based studies on video motion focus on motion customization with training-based paradigms, which, however, demands substantial training resources and necessitates retraining for diverse models. Crucially, these approaches do not explore how video diffusion models encode cross-frame motion information in their features, lacking interpretability and transparency in their effectiveness. To answer this question, this paper introduces a novel perspective to understand, localize, and manipulate motion-aware features in video diffusion models. Through analysis using Principal Component Analysis (PCA), our work discloses that robust motion-aware feature already exists in video diffusion models. We present a new MOtion FeaTure (MOFT) by eliminating content correlation information and filtering motion channels. MOFT provides a distinct set of benefits, including the ability to encode comprehensive motion information with clear interpretability, extraction without the need for training, and generalizability across diverse architectures. Leveraging MOFT, we propose a novel training-free video motion control framework. Our method demonstrates competitive performance in generating natural and faithful motion, providing architecture-agnostic insights and applicability in a variety of downstream tasks.

视频生成的核心目标是对帧间的真实且定制化的运动进行建模，因此运动的理解与控制成为该领域的关键研究课题。当前绝大多数面向视频运动的基于扩散的研究，均采用基于训练的范式实现运动定制，但此类方法需要消耗大量训练资源，且针对不同模型需重新进行训练。尤为关键的是，此类方法并未探究视频扩散模型如何在其特征中编码帧间运动信息，导致其有效性缺乏可解释性与透明度。为解答这一问题，本文提出了一种全新视角，用于理解、定位与操控视频扩散模型中的运动感知特征。通过主成分分析（Principal Component Analysis，PCA）开展分析，我们的研究揭示出视频扩散模型中本身已具备鲁棒的运动感知特征。我们提出了一种全新的运动特征（MOtion FeaTure，MOFT），通过去除内容相关信息并筛选运动通道来构建。MOFT具备一系列独特优势：可编码全面的运动信息且可解释性清晰，无需训练即可完成提取，且具备跨不同架构的泛化性。基于MOFT，我们提出了一种全新的无需训练的视频运动控制框架。该方法在生成自然且保真的运动方面展现出了具有竞争力的性能，同时提供了与架构无关的研究视角，并可应用于多种下游任务。

提供机构：

DR-NTU (Data)

创建时间：

2024-10-11

5,000+

优质数据集

54 个

任务类型

进入经典数据集