XFormer: Fast and Accurate Monocular 3D Body Capture

要約

単眼画像のみを入力として消費者向け CPU 上でリアルタイムパフォーマンスを実現する、新しいヒューマンメッシュおよびモーションキャプチャ手法である XFormer を紹介します。
提案されたネットワークアーキテクチャには 2 つのブランチが含まれています。1 つは 2D キーポイントから 3D ヒューマンメッシュの頂点を推定するキーポイントブランチ、もう 1 つは RGB 画像特徴から直接予測を行う画像ブランチです。
私たちの手法の中核となるのは、2D キーポイント座標と画像の空間特徴の間の注意をモデル化することで、これら 2 つのブランチを越えて情報が流れることを可能にするクロスモーダルトランスフォーマーブロックです。
私たちのアーキテクチャはスマートに設計されており、2D/3D アノテーション付きの画像、3D 擬似ラベル付きの画像、関連画像のないモーションキャプチャデータセットなど、さまざまなタイプのデータセットでトレーニングできるようになります。
これにより、システムの精度と汎化能力が効果的に向上します。
私たちのメソッドは軽量のバックボーン (MobileNetV3) 上に構築されており、非常に高速に実行され (単一の CPU コアで 30fps 以上)、依然として競争力のある精度を実現します。
さらに、HRNet バックボーンを備えた XFormer は、Huamn3.6 および 3DPW データセット上で最先端のパフォーマンスを提供します。

要約(オリジナル)

We present XFormer, a novel human mesh and motion capture method that achieves real-time performance on consumer CPUs given only monocular images as input. The proposed network architecture contains two branches: a keypoint branch that estimates 3D human mesh vertices given 2D keypoints, and an image branch that makes predictions directly from the RGB image features. At the core of our method is a cross-modal transformer block that allows information to flow across these two branches by modeling the attention between 2D keypoint coordinates and image spatial features. Our architecture is smartly designed, which enables us to train on various types of datasets including images with 2D/3D annotations, images with 3D pseudo labels, and motion capture datasets that do not have associated images. This effectively improves the accuracy and generalization ability of our system. Built on a lightweight backbone (MobileNetV3), our method runs blazing fast (over 30fps on a single CPU core) and still yields competitive accuracy. Furthermore, with an HRNet backbone, XFormer delivers state-of-the-art performance on Huamn3.6 and 3DPW datasets.

arxiv情報

著者	Lihui Qian,Xintong Han,Faqiang Wang,Hongyu Liu,Haoye Dong,Zhiwen Li,Huawei Wei,Zhe Lin,Cheng-Bin Jin
発行日	2023-05-18 16:45:26+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

XFormer: Fast and Accurate Monocular 3D Body Capture

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー