Listen, denoise, action! Audio-driven motion synthesis with diffusion models

要約

拡散モデルは、表現力が高く効率的にトレーニング可能な確率モデルとして関心が高まっています。
これらのモデルは、オーディオと同時発生する人間の動き (音声のジェスチャーなど) を合成するのに非常に適していることを示しています。これは、動きが複雑で、与えられたオーディオが非常にあいまいであり、確率論的記述が必要であるためです。
具体的には、DiffWave アーキテクチャを適応させて 3D ポーズシーケンスをモデル化し、Conformers を膨張畳み込みの代わりに配置して精度を向上させます。
また、スタイル表現の強さを調整する分類子を使用しないガイダンスを使用して、モーションスタイルを制御する方法も示します。
Trinity Speech-Gesture および ZeroEGGS データセットでのジェスチャー生成実験により、提案された方法が最高のモーション品質を達成し、表現を多かれ少なかれ際立たせることができる独特のスタイルを実現することが確認されました。
また、同じモデルアーキテクチャを使用して、ダンスモーションとパス駆動の移動を合成します。
最後に、ガイダンス手順を拡張して、合成タスクに適した方法でスタイル補間を実行し、専門家の製品モデルに接続できるようにします。
ビデオの例は、https://www.speech.kth.se/research/listen-denoise-action/ で入手できます。

要約(オリジナル)

Diffusion models have experienced a surge of interest as highly expressive yet efficiently trainable probabilistic models. We show that these models are an excellent fit for synthesising human motion that co-occurs with audio, for example co-speech gesticulation, since motion is complex and highly ambiguous given audio, calling for a probabilistic description. Specifically, we adapt the DiffWave architecture to model 3D pose sequences, putting Conformers in place of dilated convolutions for improved accuracy. We also demonstrate control over motion style, using classifier-free guidance to adjust the strength of the stylistic expression. Gesture-generation experiments on the Trinity Speech-Gesture and ZeroEGGS datasets confirm that the proposed method achieves top-of-the-line motion quality, with distinctive styles whose expression can be made more or less pronounced. We also synthesise dance motion and path-driven locomotion using the same model architecture. Finally, we extend the guidance procedure to perform style interpolation in a manner that is appealing for synthesis tasks and has connections to product-of-experts models, a contribution we believe is of independent interest. Video examples are available at https://www.speech.kth.se/research/listen-denoise-action/

arxiv情報

著者	Simon Alexanderson,Rajmund Nagy,Jonas Beskow,Gustav Eje Henter
発行日	2022-11-17 17:41:00+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Listen, denoise, action! Audio-driven motion synthesis with diffusion models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー