Disentangled Diffusion-Based 3D Human Pose Estimation with Hierarchical Spatial and Temporal Denoiser

要約

最近、単眼 3D 人間の姿勢推定のための拡散ベースの方法は、2D 姿勢シーケンスから 3D 関節座標を直接回帰することにより、最先端 (SOTA) のパフォーマンスを達成しました。
一部の方法では、より多くの人体の事前制約を明示的に組み込むために、人体の解剖学的骨格に基づいてタスクを骨の長さと骨の方向の予測に分解しますが、これらの方法のパフォーマンスは SOTA 拡散ベースの方法よりも大幅に低くなります。
これは人間の骨格のツリー構造に起因すると考えられます。
もつれを解く方法を直接適用すると、階層エラーの蓄積が増幅され、各階層を介して伝播する可能性があります。
一方、階層情報は以前の方法では完全には調査されていませんでした。
これらの問題に対処するために、DDHPose と呼ばれる、階層的空間および時間デノイザーを使用した、もつれの解けた拡散ベースの 3D 人間の姿勢推定方法が提案されています。
私たちのアプローチでは: (1) 拡散モデルの前処理中に 3D ポーズを解きほぐし、骨の長さと方向を拡散して、人間のポーズを事前に効果的にモデル化します。
拡散モデル学習を監視するために、もつれ解除損失が提案されています。
(2) 逆のプロセスでは、各関節の階層モデリングを改善するための Hierarchical Spatial and Temporal Denoiser (HSTDenoiser) を提案します。
HSTDenoiser は、階層関連空間トランスフォーマー (HRST) と階層関連時間トランスフォーマー (HRTT) の 2 つのコンポーネントで構成されます。
HRST は、関節の空間情報と各関節に対する親関節の影響を空間モデリングに利用します。一方、HRTT は、関節とその階層的に隣接する関節の両方からの情報を利用して、関節間の階層的な時間相関を調査します。

要約(オリジナル)

Recently, diffusion-based methods for monocular 3D human pose estimation have achieved state-of-the-art (SOTA) performance by directly regressing the 3D joint coordinates from the 2D pose sequence. Although some methods decompose the task into bone length and bone direction prediction based on the human anatomical skeleton to explicitly incorporate more human body prior constraints, the performance of these methods is significantly lower than that of the SOTA diffusion-based methods. This can be attributed to the tree structure of the human skeleton. Direct application of the disentangled method could amplify the accumulation of hierarchical errors, propagating through each hierarchy. Meanwhile, the hierarchical information has not been fully explored by the previous methods. To address these problems, a Disentangled Diffusion-based 3D Human Pose Estimation method with Hierarchical Spatial and Temporal Denoiser is proposed, termed DDHPose. In our approach: (1) We disentangle the 3D pose and diffuse the bone length and bone direction during the forward process of the diffusion model to effectively model the human pose prior. A disentanglement loss is proposed to supervise diffusion model learning. (2) For the reverse process, we propose Hierarchical Spatial and Temporal Denoiser (HSTDenoiser) to improve the hierarchical modeling of each joint. Our HSTDenoiser comprises two components: the Hierarchical-Related Spatial Transformer (HRST) and the Hierarchical-Related Temporal Transformer (HRTT). HRST exploits joint spatial information and the influence of the parent joint on each joint for spatial modeling, while HRTT utilizes information from both the joint and its hierarchical adjacent joints to explore the hierarchical temporal correlations among joints.

arxiv情報

著者	Qingyuan Cai,Xuecai Hu,Saihui Hou,Li Yao,Yongzhen Huang
発行日	2024-03-07 12:20:13+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Disentangled Diffusion-Based 3D Human Pose Estimation with Hierarchical Spatial and Temporal Denoiser

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー