Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion Models


具体的には、DiffWave アーキテクチャを 3D ポーズ シーケンスのモデル化に適応させ、拡張畳み込みの代わりに Conformers を配置してモデリング能力を向上させます。
また、分類子を使用しないガイダンスを使用してスタイル表現の強さを調整する、モーション スタイルの制御も示します。
また、同じモデル アーキテクチャを使用してパス駆動の移動運動も合成します。
ビデオの例、データ、コードについては、 を参照してください。


Diffusion models have experienced a surge of interest as highly expressive yet efficiently trainable probabilistic models. We show that these models are an excellent fit for synthesising human motion that co-occurs with audio, e.g., dancing and co-speech gesticulation, since motion is complex and highly ambiguous given audio, calling for a probabilistic description. Specifically, we adapt the DiffWave architecture to model 3D pose sequences, putting Conformers in place of dilated convolutions for improved modelling power. We also demonstrate control over motion style, using classifier-free guidance to adjust the strength of the stylistic expression. Experiments on gesture and dance generation confirm that the proposed method achieves top-of-the-line motion quality, with distinctive styles whose expression can be made more or less pronounced. We also synthesise path-driven locomotion using the same model architecture. Finally, we generalise the guidance procedure to obtain product-of-expert ensembles of diffusion models and demonstrate how these may be used for, e.g., style interpolation, a contribution we believe is of independent interest. See for video examples, data, and code.


著者 Simon Alexanderson,Rajmund Nagy,Jonas Beskow,Gustav Eje Henter
発行日 2023-05-16 17:59:58+00:00
arxivサイト arxiv_id(pdf)

