Mocap-2-to-3: Lifting 2D Diffusion-Based Pretrained Models for 3D Motion Capture

要約

単眼のビューから世界座標系で絶対的なポーズを回復することは、重要な課題をもたらします。
この文脈では、2つの主要な問題が発生します。
第一に、既存の方法は、限られた環境での収集が必要なトレーニングのために3Dモーションデータに依存しています。
新しいアクションのためにこのような3Dラベルをタイムリーに取得することは非現実的であり、モデルの一般化機能を厳しく制限します。
対照的に、2Dポーズははるかにアクセスしやすく、取得が簡単です。
第二に、単一の視点からメトリック空間における人の絶対的な位置を推定することは、本質的により複雑です。
これらの課題に対処するために、複雑な3Dモーションを2Dポーズに分解する新しいフレームワークであるMoCAP-2-to-3を紹介し、2Dデータを活用して、多様なシナリオでの3Dモーション再構築を強化し、世界座標系の絶対位置を正確に予測します。
当初、広範な2Dデータを使用してシングルビュー拡散モデルを植え付け、続いて、公開されている3Dデータを使用してビューの一貫性のためにマルチビュー拡散モデルを微調整しました。
この戦略は、大規模な2Dデータの効果的な使用を促進します。
さらに、グローバルな動きから局所的な行動を切り離し、地面の幾何学的な事前にエンコードする革新的な人間の動きの表現を提案し、生成モデルが2Dデータから正確な動きの事前を学習するようにします。
推論中、これにより、グローバルな動きが徐々に回復することができ、より妥当なポジショニングをもたらします。
現実世界のデータセットでのモデルのパフォーマンスを評価し、一般化とスケーラビリティの強化とともに、最先端の方法と比較して動きおよび絶対的な人間の位置付けにおける優れた精度を示します。
私たちのコードは公開されます。

要約(オリジナル)

Recovering absolute poses in the world coordinate system from monocular views presents significant challenges. Two primary issues arise in this context. Firstly, existing methods rely on 3D motion data for training, which requires collection in limited environments. Acquiring such 3D labels for new actions in a timely manner is impractical, severely restricting the model’s generalization capabilities. In contrast, 2D poses are far more accessible and easier to obtain. Secondly, estimating a person’s absolute position in metric space from a single viewpoint is inherently more complex. To address these challenges, we introduce Mocap-2-to-3, a novel framework that decomposes intricate 3D motions into 2D poses, leveraging 2D data to enhance 3D motion reconstruction in diverse scenarios and accurately predict absolute positions in the world coordinate system. We initially pretrain a single-view diffusion model with extensive 2D data, followed by fine-tuning a multi-view diffusion model for view consistency using publicly available 3D data. This strategy facilitates the effective use of large-scale 2D data. Additionally, we propose an innovative human motion representation that decouples local actions from global movements and encodes geometric priors of the ground, ensuring the generative model learns accurate motion priors from 2D data. During inference, this allows for the gradual recovery of global movements, resulting in more plausible positioning. We evaluate our model’s performance on real-world datasets, demonstrating superior accuracy in motion and absolute human positioning compared to state-of-the-art methods, along with enhanced generalization and scalability. Our code will be made publicly available.

arxiv情報

著者	Zhumei Wang,Zechen Hu,Ruoxi Guo,Huaijin Pi,Ziyong Feng,Sida Peng,Xiaowei Zhou
発行日	2025-03-06 14:32:49+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Mocap-2-to-3: Lifting 2D Diffusion-Based Pretrained Models for 3D Motion Capture

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー