VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models

要約

テキストからビデオへの最近の進歩（T2V）拡散モデルにより、忠実で現実的なビデオ統合が可能になりました。
ただし、現在のT2Vモデルは、物理学を正確に理解するための固有の能力が限られているため、物理的にもっともらしいコンテンツを生成するのに苦労しています。
T2Vモデル内の表現は物理学の理解のためのある程度の能力を持っているが、最近のビデオ自己監視学習方法のそれにかなり遅れていることがわかった。
この目的のために、VideorePAと呼ばれる新しいフレームワークを提案します。これは、Tokenレベルの関係を調整することにより、Videy Understing FoundationモデルからT2Vモデルに物理的理解能力を蒸留することを提案します。
これにより、物理学の理解のギャップが閉じられ、より多くの物理学に優れた生成が可能になります。
具体的には、トークン関係の蒸留（TRD）損失を導入し、時空間アライメントを活用して、強力な訓練を受けたT2Vモデルを微調整するのに適したソフトガイダンスを提供します。
私たちの知る限り、VideorePAは、T2Vモデルの微調整、特に物理的知識を注入するために設計された最初のレパートメソッドです。
経験的評価は、Videorepaがベースライン法であるCogvideoxの物理学の常識を大幅に強化し、関連するベンチマークの大幅な改善を達成し、直感的な物理学と一致するビデオを生成する強力な能力を実証することを示しています。
その他のビデオ結果は、https：//videorepa.github.io/で入手できます。

要約(オリジナル)

Recent advancements in text-to-video (T2V) diffusion models have enabled high-fidelity and realistic video synthesis. However, current T2V models often struggle to generate physically plausible content due to their limited inherent ability to accurately understand physics. We found that while the representations within T2V models possess some capacity for physics understanding, they lag significantly behind those from recent video self-supervised learning methods. To this end, we propose a novel framework called VideoREPA, which distills physics understanding capability from video understanding foundation models into T2V models by aligning token-level relations. This closes the physics understanding gap and enable more physics-plausible generation. Specifically, we introduce the Token Relation Distillation (TRD) loss, leveraging spatio-temporal alignment to provide soft guidance suitable for finetuning powerful pre-trained T2V models, a critical departure from prior representation alignment (REPA) methods. To our knowledge, VideoREPA is the first REPA method designed for finetuning T2V models and specifically for injecting physical knowledge. Empirical evaluations show that VideoREPA substantially enhances the physics commonsense of baseline method, CogVideoX, achieving significant improvement on relevant benchmarks and demonstrating a strong capacity for generating videos consistent with intuitive physics. More video results are available at https://videorepa.github.io/.

arxiv情報

著者	Xiangdong Zhang,Jiaqi Liao,Shaofeng Zhang,Fanqing Meng,Xiangpeng Wan,Junchi Yan,Yu Cheng
発行日	2025-05-29 17:06:44+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー