Vision Transformer for NeRF-Based View Synthesis from a Single Input Image

要約

神経放射輝度フィールド（NeRF）は、新しいビュー合成の目覚ましい進歩を示していますが、ほとんどの方法では、通常、正確なカメラポーズで同じシーンの複数の入力画像が必要です。
この作業では、単一のポーズのない画像への入力を大幅に削減しようとしています。
既存のアプローチは、3Dオブジェクトを再構築するためにローカル画像の特徴を条件としますが、多くの場合、ソースビューから遠く離れた視点でぼやけた予測をレンダリングします。
この問題に対処するために、グローバル機能とローカル機能の両方を活用して、表現力豊かな3D表現を形成することを提案します。
グローバル機能はビジョントランスフォーマーから学習され、ローカル機能は2D畳み込みネットワークから抽出されます。
新しいビューを合成するために、学習した3D表現を条件とする多層パーセプトロン（MLP）ネットワークをトレーニングして、ボリュームレンダリングを実行します。
この新しい3D表現により、ネットワークは、対称性や正規座標系などの制約を適用することなく、見えない領域を再構築できます。
私たちの方法は、単一の入力画像からのみ新しいビューをレンダリングし、単一のモデルを使用して複数のオブジェクトカテゴリに一般化することができます。
定量的および定性的評価は、提案された方法が最先端のパフォーマンスを達成し、既存のアプローチよりも豊富な詳細を提供することを示しています。

要約(オリジナル)

Although neural radiance fields (NeRF) have shown impressive advances for novel view synthesis, most methods typically require multiple input images of the same scene with accurate camera poses. In this work, we seek to substantially reduce the inputs to a single unposed image. Existing approaches condition on local image features to reconstruct a 3D object, but often render blurry predictions at viewpoints that are far away from the source view. To address this issue, we propose to leverage both the global and local features to form an expressive 3D representation. The global features are learned from a vision transformer, while the local features are extracted from a 2D convolutional network. To synthesize a novel view, we train a multilayer perceptron (MLP) network conditioned on the learned 3D representation to perform volume rendering. This novel 3D representation allows the network to reconstruct unseen regions without enforcing constraints like symmetry or canonical coordinate systems. Our method can render novel views from only a single input image and generalize across multiple object categories using a single model. Quantitative and qualitative evaluations demonstrate that the proposed method achieves state-of-the-art performance and renders richer details than existing approaches.

arxiv情報

著者	Kai-En Lin,Lin Yen-Chen,Wei-Sheng Lai,Tsung-Yi Lin,Yi-Chang Shih,Ravi Ramamoorthi
発行日	2022-07-12 17:52:04+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Vision Transformer for NeRF-Based View Synthesis from a Single Input Image

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー