Seamless Human Motion Composition with Blended Positional Encodings

要約

条件付き人間動作生成は、仮想現実、ゲーム、ロボット工学における多くのアプリケーションで重要なトピックです。
これまでの作品は、テキスト、音楽、またはシーンによってガイドされるモーションの生成に焦点を当てていましたが、これらは通常、短い持続時間に限定された孤立したモーションになります。
代わりに、一連のさまざまなテキスト記述に基づいて、長く連続するシーケンスの生成に取り組みます。
これに関連して、後処理や冗長なノイズ除去ステップなしでシームレスなヒューマンモーションコンポジション (HMC) を生成する初の拡散ベースのモデルである FlowMDM を紹介します。
このために、ノイズ除去チェーンで絶対位置エンコーディングと相対位置エンコーディングの両方を活用する技術であるブレンド位置エンコーディングを導入します。
より具体的には、グローバルモーションの一貫性は絶対段階で回復され、滑らかで現実的な遷移は相対段階で構築されます。
その結果、Babel データセットと HumanML3D データセットで、精度、リアリズム、滑らかさの点で最先端の結果が得られます。
FlowMDM は、ポーズ中心の Cross-ATtention のおかげで、モーションシーケンスごとに 1 つの説明のみでトレーニングした場合に優れており、推論時のさまざまなテキストの説明に対して堅牢になります。
最後に、既存の HMC メトリクスの制限に対処するために、急激な遷移を検出するための 2 つの新しいメトリクス、ピークジャークとジャーク下面積を提案します。

要約(オリジナル)

Conditional human motion generation is an important topic with many applications in virtual reality, gaming, and robotics. While prior works have focused on generating motion guided by text, music, or scenes, these typically result in isolated motions confined to short durations. Instead, we address the generation of long, continuous sequences guided by a series of varying textual descriptions. In this context, we introduce FlowMDM, the first diffusion-based model that generates seamless Human Motion Compositions (HMC) without any postprocessing or redundant denoising steps. For this, we introduce the Blended Positional Encodings, a technique that leverages both absolute and relative positional encodings in the denoising chain. More specifically, global motion coherence is recovered at the absolute stage, whereas smooth and realistic transitions are built at the relative stage. As a result, we achieve state-of-the-art results in terms of accuracy, realism, and smoothness on the Babel and HumanML3D datasets. FlowMDM excels when trained with only a single description per motion sequence thanks to its Pose-Centric Cross-ATtention, which makes it robust against varying text descriptions at inference time. Finally, to address the limitations of existing HMC metrics, we propose two new metrics: the Peak Jerk and the Area Under the Jerk, to detect abrupt transitions.

arxiv情報

著者	German Barquero,Sergio Escalera,Cristina Palmero
発行日	2024-02-23 18:59:40+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Seamless Human Motion Composition with Blended Positional Encodings

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー