InterActHuman: Multi-Concept Human Animation with Layout-Aligned Audio Conditions

要約

豊富なマルチモーダル条件を備えたエンドツーエンドの人間のアニメーション、例えば、テキスト、画像、オーディオは、近年顕著な進歩を達成しています。
ただし、ほとんどの既存の方法は、単一の主題をアニメーション化し、グローバルな方法で条件を注入することができ、複数の概念が豊富な人間の相互作用と人間とオブジェクトの相互作用を備えた同じビデオに表示されるシナリオを無視できます。
このようなグローバルな仮定は、人間やオブジェクトを含む複数の概念の正確および同一性の制御を防ぐため、アプリケーションを妨げます。
この作業では、単一体の仮定を廃棄し、モダリティから各アイデンティティの空間的フットプリントへの条件の強力で領域固有の結合を強制する新しいフレームワークを導入します。
複数の概念の参照画像が与えられた場合、私たちの方法は、マスク予測子を活用して、除去されたビデオと各参照の外観との間の外観のキューを一致させることにより、レイアウト情報を自動的に推測できます。
さらに、レイアウトに沿ったモダリティマッチングを反復的に確保するために、対応する領域にローカルオーディオ条件を注入します。
この設計により、高品質の制御可能なマルチコンセプトヒューマン中心のビデオが可能になります。
経験的結果とアブレーション研究は、暗黙の対応物やその他の既存の方法と比較して、マルチモーダル条件に対する明示的なレイアウト制御の有効性を検証します。

要約(オリジナル)

End-to-end human animation with rich multi-modal conditions, e.g., text, image and audio has achieved remarkable advancements in recent years. However, most existing methods could only animate a single subject and inject conditions in a global manner, ignoring scenarios that multiple concepts could appears in the same video with rich human-human interactions and human-object interactions. Such global assumption prevents precise and per-identity control of multiple concepts including humans and objects, therefore hinders applications. In this work, we discard the single-entity assumption and introduce a novel framework that enforces strong, region-specific binding of conditions from modalities to each identity’s spatiotemporal footprint. Given reference images of multiple concepts, our method could automatically infer layout information by leveraging a mask predictor to match appearance cues between the denoised video and each reference appearance. Furthermore, we inject local audio condition into its corresponding region to ensure layout-aligned modality matching in a iterative manner. This design enables the high-quality generation of controllable multi-concept human-centric videos. Empirical results and ablation studies validate the effectiveness of our explicit layout control for multi-modal conditions compared to implicit counterparts and other existing methods.

arxiv情報

著者	Zhenzhi Wang,Jiaqi Yang,Jianwen Jiang,Chao Liang,Gaojie Lin,Zerong Zheng,Ceyuan Yang,Dahua Lin
発行日	2025-06-11 17:57:09+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

InterActHuman: Multi-Concept Human Animation with Layout-Aligned Audio Conditions

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー