From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations

要約

本研究では、2人組の会話のダイナミクスに従ってジェスチャーを行う、フルボディのフォトリアリスティックなアバターを生成するためのフレームワークを紹介する。音声が与えられた場合、顔、体、手を含む、個人のジェスチャーモーションの複数の可能性を出力する。我々の手法の鍵は、ベクトル量子化によるサンプルの多様性の利点と、拡散によって得られる高周波数の詳細とを組み合わせることで、よりダイナミックで表現力豊かなモーションを生成することにある。生成されたモーションは、ジェスチャーにおける重要なニュアンス（例えば、嘲笑や微笑）を表現できる、非常にフォトリアリスティックなアバターを使って視覚化する。この研究を促進するために、フォトリアリスティックな再構成を可能にする、世界初のマルチビュー会話データセットを導入する。実験によれば、我々のモデルは適切で多様なジェスチャーを生成し、拡散法とVQ法の両方を凌駕する。さらに、我々の知覚評価により、会話ジェスチャーにおける微妙な動きの詳細を正確に評価する上で、フォトリアリズム（対メッシュ）が重要であることが強調された。コードとデータセットはオンラインで入手可能。

要約(オリジナル)

We present a framework for generating full-bodied photorealistic avatars that gesture according to the conversational dynamics of a dyadic interaction. Given speech audio, we output multiple possibilities of gestural motion for an individual, including face, body, and hands. The key behind our method is in combining the benefits of sample diversity from vector quantization with the high-frequency details obtained through diffusion to generate more dynamic, expressive motion. We visualize the generated motion using highly photorealistic avatars that can express crucial nuances in gestures (e.g. sneers and smirks). To facilitate this line of research, we introduce a first-of-its-kind multi-view conversational dataset that allows for photorealistic reconstruction. Experiments show our model generates appropriate and diverse gestures, outperforming both diffusion- and VQ-only methods. Furthermore, our perceptual evaluation highlights the importance of photorealism (vs. meshes) in accurately assessing subtle motion details in conversational gestures. Code and dataset available online.

arxiv情報

著者	Evonne Ng,Javier Romero,Timur Bagautdinov,Shaojie Bai,Trevor Darrell,Angjoo Kanazawa,Alexander Richard
発行日	2024-01-03 18:55:16+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー