InterDreamer: Zero-Shot Text to 3D Dynamic Human-Object Interaction

要約

テキスト条件付き人間モーション生成は、広範なモーションキャプチャデータと対応するテキスト注釈でトレーニングされた拡散モデルによって大幅な進歩を遂げました。
しかし、このような成功を 3D ダイナミックヒューマンオブジェクトインタラクション (HOI) 生成に拡張することは、主に大規模なインタラクションデータとこれらのインタラクションに合わせた包括的な説明が不足していることが原因で、顕著な課題に直面しています。
この論文は率先して、テキストとインタラクションのペアデータで直接トレーニングすることなく、人間とオブジェクトのインタラクションを生成できる可能性を示しています。
これを達成するための重要な洞察は、インタラクションのセマンティクスとダイナミクスを分離できるということです。
教師ありトレーニングを通じてインタラクションセマンティクスを学習することはできないため、代わりに事前トレーニングされた大規模モデルを活用し、大規模な言語モデルとテキストからモーションへのモデルからの知識を相乗させます。
このような知識は、インタラクションセマンティクスに対する高レベルの制御を提供しますが、低レベルのインタラクションダイナミクスの複雑さを把握することはできません。
この問題を克服するために、単純な物理学を理解するように設計された世界モデルをさらに導入し、人間の行動が物体の動きにどのような影響を与えるかをモデル化します。
これらのコンポーネントを統合することにより、当社の新しいフレームワークである InterDreamer は、ゼロショット方式でテキスト整列された 3D HOI シーケンスを生成できます。
InterDreamer を BEHAVE および CHAIRS データセットに適用し、包括的な実験分析により、テキストディレクティブとシームレスに整合する現実的で一貫したインタラクションシーケンスを生成する機能が実証されました。

要約(オリジナル)

Text-conditioned human motion generation has experienced significant advancements with diffusion models trained on extensive motion capture data and corresponding textual annotations. However, extending such success to 3D dynamic human-object interaction (HOI) generation faces notable challenges, primarily due to the lack of large-scale interaction data and comprehensive descriptions that align with these interactions. This paper takes the initiative and showcases the potential of generating human-object interactions without direct training on text-interaction pair data. Our key insight in achieving this is that interaction semantics and dynamics can be decoupled. Being unable to learn interaction semantics through supervised training, we instead leverage pre-trained large models, synergizing knowledge from a large language model and a text-to-motion model. While such knowledge offers high-level control over interaction semantics, it cannot grasp the intricacies of low-level interaction dynamics. To overcome this issue, we further introduce a world model designed to comprehend simple physics, modeling how human actions influence object motion. By integrating these components, our novel framework, InterDreamer, is able to generate text-aligned 3D HOI sequences in a zero-shot manner. We apply InterDreamer to the BEHAVE and CHAIRS datasets, and our comprehensive experimental analysis demonstrates its capability to generate realistic and coherent interaction sequences that seamlessly align with the text directives.

arxiv情報

著者	Sirui Xu,Ziyin Wang,Yu-Xiong Wang,Liang-Yan Gui
発行日	2024-03-28 17:59:30+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

InterDreamer: Zero-Shot Text to 3D Dynamic Human-Object Interaction

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー