EchoMask: Speech-Queried Attention-based Mask Modeling for Holistic Co-Speech Motion Generation

要約

マスクされたモデリングフレームワークは、共和声モーション生成に有望を示しています。
ただし、効果的なモーションマスキングのために、意味的に重要なフレームを特定するのに苦労しています。
この作業では、音声でのモーション生成のための音声queけの注意ベースのマスクモデリングフレームワークを提案します。
私たちの重要な洞察は、モーションに整合した音声機能を活用して、マスクされたモーションモデリングプロセスをガイドし、リズム関連および意味的に表現力のあるモーションフレームを選択的にマスキングすることです。
具体的には、最初に潜在的な動きとオーディオのジョイント空間を構築するためのモーションオーディオアライメントモジュール（MAM）を提案します。
このスペースでは、低レベルと高レベルの音声機能の両方が予測されており、学習可能な音声クエリを使用してモーション整列音声表現を可能にします。
次に、音声Queed注意メカニズム（SQA）が導入され、モーションキーと音声クエリ間の相互作用を通じてフレームレベルの注意スコアを計算し、注意スコアのあるモーションフレームに向けて選択的マスキングを導きます。
最後に、モーションに整列した音声機能も生成ネットワークに注入され、共和音のモーション生成を促進します。
定性的および定量的評価は、我々の方法が既存の最先端のアプローチよりも優れており、高品質の共発発点モーションを成功裏に生成することを確認しています。

要約(オリジナル)

Masked modeling framework has shown promise in co-speech motion generation. However, it struggles to identify semantically significant frames for effective motion masking. In this work, we propose a speech-queried attention-based mask modeling framework for co-speech motion generation. Our key insight is to leverage motion-aligned speech features to guide the masked motion modeling process, selectively masking rhythm-related and semantically expressive motion frames. Specifically, we first propose a motion-audio alignment module (MAM) to construct a latent motion-audio joint space. In this space, both low-level and high-level speech features are projected, enabling motion-aligned speech representation using learnable speech queries. Then, a speech-queried attention mechanism (SQA) is introduced to compute frame-level attention scores through interactions between motion keys and speech queries, guiding selective masking toward motion frames with high attention scores. Finally, the motion-aligned speech features are also injected into the generation network to facilitate co-speech motion generation. Qualitative and quantitative evaluations confirm that our method outperforms existing state-of-the-art approaches, successfully producing high-quality co-speech motion.

arxiv情報

著者	Xiangyue Zhang,Jianfang Li,Jiaxu Zhang,Jianqiang Ren,Liefeng Bo,Zhigang Tu
発行日	2025-04-15 15:41:20+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

EchoMask: Speech-Queried Attention-based Mask Modeling for Holistic Co-Speech Motion Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー