HunyuanVideo-HOMA: Generic Human-Object Interaction in Multimodal Driven Human Animation

要約

人間とオブジェクトの相互作用（HOI）ビデオ生成の重要な制限に対処するために、具体的には、キュレーションされたモーションデータ、新しいオブジェクト/シナリオへの限定的な一般化、およびアクセシビリティの制限に依存します。
Hunyuanvideo-Homaは、制御性を向上させ、まばらで分離されたモーションガイダンスを介して、正確な入力への依存度を低下させます。
マルチモーダル拡散トランス（MMDIT）のデュアル入力空間への外観とモーションシグナルをエンコードし、それらを共有コンテキスト空間内で融合して、一時的に一貫した物理的にもっともらしい相互作用を合成します。
トレーニングを最適化するために、前提条件のMMDIT重みから初期化されたパラメーター空間HOIアダプターを統合し、効率的な適応を可能にしながら事前知識を維持し、解剖学的に正確なオーディオ駆動型の唇同期のための顔の横断的なアダプターを統合します。
広範な実験では、相互作用の自然性と弱い監督下での一般化における最先端のパフォーマンスを確認します。
最後に、Hunyuanvideo-Homaは、ユーザーフレンドリーなデモインターフェイスによってサポートされている、テキストコンディショナルの生成およびインタラクティブなオブジェクト操作における汎用性を示しています。
プロジェクトページはhttps://anonymous.4open.science/w/homa-page-0fbe/にあります。

要約(オリジナル)

To address key limitations in human-object interaction (HOI) video generation — specifically the reliance on curated motion data, limited generalization to novel objects/scenarios, and restricted accessibility — we introduce HunyuanVideo-HOMA, a weakly conditioned multimodal-driven framework. HunyuanVideo-HOMA enhances controllability and reduces dependency on precise inputs through sparse, decoupled motion guidance. It encodes appearance and motion signals into the dual input space of a multimodal diffusion transformer (MMDiT), fusing them within a shared context space to synthesize temporally consistent and physically plausible interactions. To optimize training, we integrate a parameter-space HOI adapter initialized from pretrained MMDiT weights, preserving prior knowledge while enabling efficient adaptation, and a facial cross-attention adapter for anatomically accurate audio-driven lip synchronization. Extensive experiments confirm state-of-the-art performance in interaction naturalness and generalization under weak supervision. Finally, HunyuanVideo-HOMA demonstrates versatility in text-conditioned generation and interactive object manipulation, supported by a user-friendly demo interface. The project page is at https://anonymous.4open.science/w/homa-page-0FBE/.

arxiv情報

著者	Ziyao Huang,Zixiang Zhou,Juan Cao,Yifeng Ma,Yi Chen,Zejing Rao,Zhiyong Xu,Hongmei Wang,Qin Lin,Yuan Zhou,Qinglin Lu,Fan Tang
発行日	2025-06-10 13:45:00+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

HunyuanVideo-HOMA: Generic Human-Object Interaction in Multimodal Driven Human Animation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー