MVSFormer: Multi-View Stereo with Pre-trained Vision Transformers and Temperature-based Depth

要約

特徴表現学習は、学習型マルチビューステレオ(MVS)の重要なレシピである。学習型MVSの一般的な特徴抽出器として、バニラ特徴ピラミッドネットワーク（FPN）は、反射やテクスチャのない領域に対する特徴表現に問題があり、MVSの汎化を制限している。また、CNN（Convolutional Neural Networks）を用いて学習させたFPNでも、これらの問題に対処できない。一方、Vision Transformer (ViT)は多くの2次元視覚タスクにおいて顕著な成功を収めている。そこで、我々はViTがMVSにおける特徴量学習を促進できるかどうかを問う。本論文では、MVSFormerと呼ばれる事前に学習されたViT拡張MVSネットワークを提案する。さらに、ViTの重みを凍結したMVSFormer-Pと、学習可能なMVSFormer-Hをそれぞれ提案する。MVSFormer-Pはより効率的であり、MVSFormer-Hはより優れた性能を達成することができる。MVSFormerは勾配累積により強化された効率的なマルチスケール学習により、様々な入力解像度に汎用化することが可能である。さらに、分類と回帰に基づくMVS手法の利点と欠点を議論し、さらに、温度に基づく戦略でそれらを統一することを提案する。MVSFormerはDTUデータセットにおいて、最先端の性能を達成した。特に、匿名で提出したMVSFormerは、他の公開作品と比較して、提出日に競争の激しいTanks-and-Templesリーダーボードの中級および上級セットでトップ1のポジションにランクされています。コードとモデルは近日公開予定です。

要約(オリジナル)

Feature representation learning is the key recipe for learning-based Multi-View Stereo (MVS). As the common feature extractor of learning-based MVS, vanilla Feature Pyramid Networks (FPN) suffers from discouraged feature representations for reflection and texture-less areas, which limits the generalization of MVS. Even FPNs worked with pre-trained Convolutional Neural Networks (CNNs) fail to tackle these issues. On the other hand, Vision Transformers (ViTs) have achieved prominent success in many 2D vision tasks. Thus we ask whether ViTs can facilitate feature learning in MVS? In this paper, we propose a pre-trained ViT enhanced MVS network called MVSFormer, which can learn more reliable feature representations benefited by informative priors from ViT. Then MVSFormer-P and MVSFormer-H are further proposed with freezed ViT weights and trainable ones respectively. MVSFormer-P is more efficient while MVSFormer-H can achieve superior performance. MVSFormer can be generalized to various input resolutions with the efficient multi-scale training strengthened by gradient accumulation. Moreover, we discuss the merits and drawbacks of classification and regression-based MVS methods, and further propose to unify them with a temperature-based strategy. MVSFormer achieves state-of-the-art performance on the DTU dataset. Particularly, our anonymous submission of MVSFormer is ranked in the Top-1 position on both intermediate and advanced sets of the highly competitive Tanks-and-Temples leaderboard on the day of submission compared with other published works. Codes and models will be released soon.

arxiv情報

著者	Chenjie Cao,Xinlin Ren,Yanwei Fu
発行日	2022-08-08 16:49:16+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

MVSFormer: Multi-View Stereo with Pre-trained Vision Transformers and Temperature-based Depth

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー