Towards Variable and Coordinated Holistic Co-Speech Motion Generation

要約

この論文では、可変性と調整という 2 つの重要な側面に焦点を当てて、3D アバター用の本物のような全体的な同時音声モーションを生成する問題について取り上げます。
可変性により、類似した音声内容であってもアバターは幅広い動作を示すことができ、調整により顔の表情、手のジェスチャー、体のポーズが確実に調和して配置されます。
私たちは、音声中の顔、手、体の動きを共同でモデル化するように設計された統合確率フレームワークである ProbTalk を使用して、両方を達成することを目指しています。
ProbTalk は、変分オートエンコーダ (VAE) アーキテクチャに基づいて構築されており、3 つのコア設計が組み込まれています。
まず、積量子化 (PQ) を VAE に導入し、複雑な全体的な動きの表現を強化します。
次に、積量子化表現に 2D 位置エンコーディングを埋め込む新しい非自己回帰モデルを考案し、それによって PQ コードの重要な構造情報を保存します。
最後に、第 2 段階を使用して予備予測を改良し、高周波の詳細をさらに鮮明にします。
これら 3 つの設計を組み合わせることで、ProbTalk は自然で多様な全体的な同時音声モーションを生成することができ、定性的および定量的評価、特にリアリズムの点でいくつかの最先端の方法を上回ります。
私たちのコードとモデルは、研究目的で https://feifeifeiliu.github.io/probtalk/ で公開されます。

要約(オリジナル)

This paper addresses the problem of generating lifelike holistic co-speech motions for 3D avatars, focusing on two key aspects: variability and coordination. Variability allows the avatar to exhibit a wide range of motions even with similar speech content, while coordination ensures a harmonious alignment among facial expressions, hand gestures, and body poses. We aim to achieve both with ProbTalk, a unified probabilistic framework designed to jointly model facial, hand, and body movements in speech. ProbTalk builds on the variational autoencoder (VAE) architecture and incorporates three core designs. First, we introduce product quantization (PQ) to the VAE, which enriches the representation of complex holistic motion. Second, we devise a novel non-autoregressive model that embeds 2D positional encoding into the product-quantized representation, thereby preserving essential structure information of the PQ codes. Last, we employ a secondary stage to refine the preliminary prediction, further sharpening the high-frequency details. Coupling these three designs enables ProbTalk to generate natural and diverse holistic co-speech motions, outperforming several state-of-the-art methods in qualitative and quantitative evaluations, particularly in terms of realism. Our code and model will be released for research purposes at https://feifeifeiliu.github.io/probtalk/.

arxiv情報

著者	Yifei Liu,Qiong Cao,Yandong Wen,Huaiguang Jiang,Changxing Ding
発行日	2024-04-15 11:18:00+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Towards Variable and Coordinated Holistic Co-Speech Motion Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー