VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models

要約

最近の大きな進歩にもかかわらず、生成的なビデオモデルは、現実世界の動き、ダイナミクス、および物理をキャプチャするのに苦労している。この限界は、従来の画素再構成の目的から生じるものであり、モデルは動きの一貫性を犠牲にして、外観の忠実性に偏る。この問題を解決するために、我々はVideoJAMを導入する。VideoJAMは、ビデオジェネレータの前に効果的な動きを導入する新しいフレームワークであり、モデルに外観と動きの合同表現を学習させる。VideoJAMは2つの相補的なユニットから構成される。学習時には、1つの学習済み表現から、生成された画素とそれに対応する動きの両方を予測するように目的を拡張する。推論時には、インナーガイダンス（Inner-Guidance）を導入する。インナーガイダンスとは、モデル自身の進化する動き予測を動的なガイダンス信号として活用することで、一貫性のある動きへと生成を誘導するメカニズムである。注目すべきは、我々のフレームワークは、最小限の適応でどのようなビデオモデルにも適用できることである。VideoJAMは動きのコヒーレンスにおいて最先端の性能を達成し、競合の激しい独自モデルを凌駕すると同時に、世代の知覚される視覚的品質を向上させる。これらの結果は、外観とモーションは補完的であり、効果的に統合された場合、ビデオ生成の視覚的品質と一貫性の両方を高めることができることを強調している。プロジェクトウェブサイト：https://hila-chefer.github.io/videojam-paper.github.io/

要約(オリジナル)

Despite tremendous recent progress, generative video models still struggle to capture real-world motion, dynamics, and physics. We show that this limitation arises from the conventional pixel reconstruction objective, which biases models toward appearance fidelity at the expense of motion coherence. To address this, we introduce VideoJAM, a novel framework that instills an effective motion prior to video generators, by encouraging the model to learn a joint appearance-motion representation. VideoJAM is composed of two complementary units. During training, we extend the objective to predict both the generated pixels and their corresponding motion from a single learned representation. During inference, we introduce Inner-Guidance, a mechanism that steers the generation toward coherent motion by leveraging the model’s own evolving motion prediction as a dynamic guidance signal. Notably, our framework can be applied to any video model with minimal adaptations, requiring no modifications to the training data or scaling of the model. VideoJAM achieves state-of-the-art performance in motion coherence, surpassing highly competitive proprietary models while also enhancing the perceived visual quality of the generations. These findings emphasize that appearance and motion can be complementary and, when effectively integrated, enhance both the visual quality and the coherence of video generation. Project website: https://hila-chefer.github.io/videojam-paper.github.io/

arxiv情報

著者	Hila Chefer,Uriel Singer,Amit Zohar,Yuval Kirstain,Adam Polyak,Yaniv Taigman,Lior Wolf,Shelly Sheynin
発行日	2025-02-04 17:07:10+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー