Cross-view and Cross-pose Completion for 3D Human Understanding

要約

人間の知覚と理解は、コンピュータービジョンの主要な領域であり、最近の他の多くのビジョンサブドメインと同様に、大規模なデータセットで事前トレーニングされた大規模なモデルの使用から利益を得られる立場にあります。
ImageNet などの汎用のオブジェクト中心の画像データセットに依存する最も一般的な事前トレーニング戦略は、重要なドメインのシフトによって制限されるという仮説を立てています。
一方、2D ラベルや 3D ラベルなどのドメイン固有のグラウンドトゥルースの収集は、適切に拡張できません。
したがって、画像のみを使用して人間中心のデータに作用する自己教師あり学習に基づく事前トレーニングアプローチを提案します。
私たちの方法では、人間の画像のペアを使用します。最初の画像は部分的にマスクされており、可視部分と 2 番目の画像が与えられた場合に、マスクされた部分を再構築するようにモデルがトレーニングされます。
3D および人間の動きについての事前学習を行うために、ビデオから取得された立体視 (クロスビュー) ペアと時間的 (クロスポーズ) ペアの両方に依存します。
身体中心のタスク用のモデルと手中心のタスク用のモデルを事前トレーニングします。
汎用トランスフォーマーアーキテクチャを採用したこれらのモデルは、人間中心の下流タスクの幅広いセットにおいて既存の自己教師付き事前トレーニング手法よりも優れたパフォーマンスを発揮し、たとえばモデルベースおよびモデルの微調整時に最先端のパフォーマンスを実現します。
-無料の人間メッシュの回復。

要約(オリジナル)

Human perception and understanding is a major domain of computer vision which, like many other vision subdomains recently, stands to gain from the use of large models pre-trained on large datasets. We hypothesize that the most common pre-training strategy of relying on general purpose, object-centric image datasets such as ImageNet, is limited by an important domain shift. On the other hand, collecting domain specific ground truth such as 2D or 3D labels does not scale well. Therefore, we propose a pre-training approach based on self-supervised learning that works on human-centric data using only images. Our method uses pairs of images of humans: the first is partially masked and the model is trained to reconstruct the masked parts given the visible ones and a second image. It relies on both stereoscopic (cross-view) pairs, and temporal (cross-pose) pairs taken from videos, in order to learn priors about 3D as well as human motion. We pre-train a model for body-centric tasks and one for hand-centric tasks. With a generic transformer architecture, these models outperform existing self-supervised pre-training methods on a wide set of human-centric downstream tasks, and obtain state-of-the-art performance for instance when fine-tuning for model-based and model-free human mesh recovery.

arxiv情報

著者	Matthieu Armando,Salma Galaaoui,Fabien Baradel,Thomas Lucas,Vincent Leroy,Romain Brégier,Philippe Weinzaepfel,Grégory Rogez
発行日	2023-11-15 16:51:18+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Cross-view and Cross-pose Completion for 3D Human Understanding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー