Video Diffusion Models are Training-free Motion Interpreter and Controller

要約

ビデオ生成は主に、フレームをまたがる本物のカスタマイズされたモーションをモデル化することを目的としており、モーションの理解と制御が重要なトピックとなっている。ビデオモーションに関する拡散ベースの研究のほとんどは、トレーニングベースのパラダイムを用いたモーションのカスタマイズに焦点を当てているが、これはかなりのトレーニングリソースを必要とし、多様なモデルの再トレーニングが必要となる。重要なことは、これらのアプローチでは、動画拡散モデルがフレーム間の動き情報をどのように特徴量にエンコードしているのかを探求しておらず、その有効性の解釈可能性と透明性に欠けていることである。この疑問に答えるため、本論文では、動画拡散モデルにおける動きを考慮した特徴を理解し、局所化し、操作するための新しい視点を導入する。主成分分析(PCA)を用いた分析により、動画拡散モデルにロバストな動き認識特徴が既に存在することを明らかにする。我々は、コンテンツ相関情報を除去し、動きチャンネルをフィルタリングすることにより、新しいMOtion特徴（MOFTure）を提示する。MOFTは、明確な解釈可能性を持つ包括的な動き情報を符号化する能力、トレーニング不要の抽出、多様なアーキテクチャにわたる汎用性など、一連の明確な利点を提供する。MOFTを活用し、トレーニング不要の新しいビデオモーション制御フレームワークを提案する。本手法は、自然で忠実なモーションの生成において競争力のある性能を示し、アーキテクチャにとらわれない洞察と、様々な下流タスクへの適用性を提供する。

要約(オリジナル)

Video generation primarily aims to model authentic and customized motion across frames, making understanding and controlling the motion a crucial topic. Most diffusion-based studies on video motion focus on motion customization with training-based paradigms, which, however, demands substantial training resources and necessitates retraining for diverse models. Crucially, these approaches do not explore how video diffusion models encode cross-frame motion information in their features, lacking interpretability and transparency in their effectiveness. To answer this question, this paper introduces a novel perspective to understand, localize, and manipulate motion-aware features in video diffusion models. Through analysis using Principal Component Analysis (PCA), our work discloses that robust motion-aware feature already exists in video diffusion models. We present a new MOtion FeaTure (MOFT) by eliminating content correlation information and filtering motion channels. MOFT provides a distinct set of benefits, including the ability to encode comprehensive motion information with clear interpretability, extraction without the need for training, and generalizability across diverse architectures. Leveraging MOFT, we propose a novel training-free video motion control framework. Our method demonstrates competitive performance in generating natural and faithful motion, providing architecture-agnostic insights and applicability in a variety of downstream tasks.

arxiv情報

著者	Zeqi Xiao,Yifan Zhou,Shuai Yang,Xingang Pan
発行日	2024-11-01 12:46:26+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Video Diffusion Models are Training-free Motion Interpreter and Controller

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー