Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation

要約

トーキングヘッドやトーキングボディジェネレーションなどのオーディオ駆動型の人間のアニメーション方法は、同期された顔の動きと魅力的な視覚的品質のビデオを生成する際に顕著な進歩を遂げました。
ただし、既存の方法は主に単一の人間のアニメーションに焦点を当て、マルチストリームオーディオ入力との闘いで、オーディオと人の間の誤った結合問題に直面しています。
さらに、指示に従う機能に制限を示します。
この問題を解決するために、この論文では、マルチパーソンの会話ビデオ生成という新しいタスクを提案し、マルチパーソン世代の課題に対処するための新しいフレームワークであるMultiTalkを紹介します。
具体的には、オーディオインジェクションのために、いくつかのスキームを調査し、オーディオおよび個人の結合問題を解決するために、ラベル回転位置埋め込み（L-Rope）メソッドを提案します。
さらに、トレーニング中に、基本モデルの指導中の能力を維持するためには、部分パラメータートレーニングとマルチタスクトレーニングが重要であることがわかります。
MultiTalkは、トーキングヘッド、トーキングボディ、マルチパーソンデータセットなど、いくつかのデータセットの他のメソッドと比較して優れたパフォーマンスを実現し、アプローチの強力な生成能力を実証しています。

要約(オリジナル)

Audio-driven human animation methods, such as talking head and talking body generation, have made remarkable progress in generating synchronized facial movements and appealing visual quality videos. However, existing methods primarily focus on single human animation and struggle with multi-stream audio inputs, facing incorrect binding problems between audio and persons. Additionally, they exhibit limitations in instruction-following capabilities. To solve this problem, in this paper, we propose a novel task: Multi-Person Conversational Video Generation, and introduce a new framework, MultiTalk, to address the challenges during multi-person generation. Specifically, for audio injection, we investigate several schemes and propose the Label Rotary Position Embedding (L-RoPE) method to resolve the audio and person binding problem. Furthermore, during training, we observe that partial parameter training and multi-task training are crucial for preserving the instruction-following ability of the base model. MultiTalk achieves superior performance compared to other methods on several datasets, including talking head, talking body, and multi-person datasets, demonstrating the powerful generation capabilities of our approach.

arxiv情報

著者	Zhe Kong,Feng Gao,Yong Zhang,Zhuoliang Kang,Xiaoming Wei,Xunliang Cai,Guanying Chen,Wenhan Luo
発行日	2025-05-28 17:57:06+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー