InterDyn: Controllable Interactive Dynamics with Video Diffusion Models

要約

相互作用する物体のダイナミクスを予測することは、人間にとっても知的システムにとっても不可欠である。しかし、既存のアプローチは単純化されたおもちゃの設定に限定されており、複雑な実世界環境に対する一般化可能性に欠けている。生成モデルにおける最近の進歩は、介入に基づく状態遷移の予測を可能にしているが、相互作用から生じる連続的なダイナミクスを無視した単一の未来状態の生成に焦点を当てている。このギャップに対処するために、我々はInterDynを提案する。InterDynは、初期フレームと、駆動オブジェクトやアクターの動きをエンコードする制御信号が与えられた場合に、対話的ダイナミクスの動画を生成する新しいフレームワークである。我々の重要な洞察は、大規模なビデオデータからインタラクティブダイナミクスを学習した大規模なビデオ生成モデルは、神経レンダラーとしても暗黙の物理“シミュレーター”としても機能することである。この能力を効果的に利用するために、我々は、駆動エンティティの動きにビデオ生成プロセスを条件付ける対話型制御メカニズムを導入する。定性的な結果は、InterDynが複雑なオブジェクトの相互作用のもっともらしく時間的に一貫性のあるビデオを生成し、同時に未見のオブジェクトにも汎化することを示している。定量的な評価では、InterDynは静的な状態遷移に焦点を当てたベースラインを凌駕している。この研究は、暗黙の物理エンジンとしてビデオ生成モデルを活用する可能性を強調している。プロジェクトページ: https://interdyn.is.tue.mpg.de/

要約(オリジナル)

Predicting the dynamics of interacting objects is essential for both humans and intelligent systems. However, existing approaches are limited to simplified, toy settings and lack generalizability to complex, real-world environments. Recent advances in generative models have enabled the prediction of state transitions based on interventions, but focus on generating a single future state which neglects the continuous dynamics resulting from the interaction. To address this gap, we propose InterDyn, a novel framework that generates videos of interactive dynamics given an initial frame and a control signal encoding the motion of a driving object or actor. Our key insight is that large video generation models can act as both neural renderers and implicit physics “simulators”, having learned interactive dynamics from large-scale video data. To effectively harness this capability, we introduce an interactive control mechanism that conditions the video generation process on the motion of the driving entity. Qualitative results demonstrate that InterDyn generates plausible, temporally consistent videos of complex object interactions while generalizing to unseen objects. Quantitative evaluations show that InterDyn outperforms baselines that focus on static state transitions. This work highlights the potential of leveraging video generative models as implicit physics engines. Project page: https://interdyn.is.tue.mpg.de/

arxiv情報

著者	Rick Akkerman,Haiwen Feng,Michael J. Black,Dimitrios Tzionas,Victoria Fernández Abrevaya
発行日	2025-04-04 14:22:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

InterDyn: Controllable Interactive Dynamics with Video Diffusion Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー