MVSFormer: Learning Robust Image Representations via Transformers and Temperature-based Depth for Multi-View Stereo

要約

特徴表現学習は、学習型マルチビューステレオ(MVS)の重要なレシピである。学習型MVSの一般的な特徴抽出器として、バニラ特徴ピラミッドネットワーク（FPN）は、反射やテクスチャのない領域の特徴表現に問題があり、MVSの汎化を制限している。また、CNN（Convolutional Neural Networks）を用いて学習させたFPNでも、これらの問題に対処できない。一方、Vision Transformer (ViT)は多くの2次元視覚タスクにおいて顕著な成功を収めている。そこで、我々はViTがMVSにおける特徴量学習を促進できるかどうかを問う。本論文では、MVSFormerと呼ばれるViTを強化したMVSネットワークを提案し、ViTから情報量の多い事前分布を得ることで、より信頼性の高い特徴表現を学習することができる。さらに、ViTの重みを固定したMVSFormer-Pと、学習可能なMVSFormer-Hをそれぞれ提案する。MVSFormer-Pはより効率的であり、MVSFormer-Hはより優れた性能を達成することができる。また、ViTをMVSタスクの任意の解像度に頑健にするために、勾配累積を用いた効率的なマルチスケール学習を提案する。さらに、分類と回帰に基づくMVS手法の利点と欠点を議論し、さらに、温度に基づく戦略でそれらを統一することを提案する。MVSFormerはDTUデータセットにおいて、最先端の性能を達成した。特に、匿名で投稿したMVSFormerは、他の公開作品と比較して、投稿当日に競争の激しいTanks-and-Temples leaderboardの中級・上級セットでトップ1位を獲得しています。コードとモデルを公開する予定です。

要約(オリジナル)

Feature representation learning is the key recipe for learning-based Multi-View Stereo (MVS). As the common feature extractor of learning-based MVS, vanilla Feature Pyramid Networks (FPN) suffers from discouraged feature representations for reflection and texture-less areas, which limits the generalization of MVS. Even FPNs worked with pre-trained Convolutional Neural Networks (CNNs) fail to tackle these issues. On the other hand, Vision Transformers (ViTs) have achieved prominent success in many 2D vision tasks. Thus we ask whether ViTs can facilitate the feature learning in MVS? In this paper, we propose a pre-trained ViT enhanced MVS network called MVSFormer, which can learn more reliable feature representations benefited by informative priors from ViT. Then MVSFormer-P and MVSFormer-H are further proposed with fixed ViT weights and trainable ones respectively. MVSFormer-P is more efficient while MVSFormer-H can achieve superior performance. To make ViTs robust to arbitrary resolutions for MVS tasks, we propose to use an efficient multi-scale training with gradient accumulation. Moreover, we discuss the merits and drawbacks of classification and regression-based MVS methods, and further propose to unify them with a temperature-based strategy. MVSFormer achieves state-of-the-art performance on the DTU dataset. Particularly, our anonymous submission of MVSFormer is ranked in the Top-1 position on both intermediate and advanced sets of the highly competitive Tanks-and-Temples leaderboard on the day of submission compared with other published works. Codes and models will be released.

arxiv情報

著者	Chenjie Cao,Xinlin Ren,Yanwei Fu
発行日	2022-08-04 09:17:30+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

MVSFormer: Learning Robust Image Representations via Transformers and Temperature-based Depth for Multi-View Stereo

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー