EffoVPR: Effective Foundation Model Utilization for Visual Place Recognition

要約

視覚的場所認識(Visual Place Recognition: VPR)のタスクは、地理タグ付き画像のデータベースからクエリ画像の場所を予測することである。最近のVPRの研究では、VPRタスクにDINOv2のような事前に訓練された基礎モデルを採用することの大きな利点が強調されている。しかし、これらのモデルは、VPRに特化したデータでさらなる微調整を行わない限り、VPRには不十分であるとみなされることが多い。本論文では、VPRのために基礎モデルの潜在能力を活用する効果的なアプローチを提示する。本論文では、自己注意層から抽出された特徴が、ゼロショット設定においても、VPRのための強力な再ランカーとして機能することを示す。我々の手法は、これまでのゼロショット手法を凌駕するだけでなく、いくつかの教師あり手法に匹敵する結果をもたらす。次に、プーリングのために内部ViTレイヤーを利用するシングルステージアプローチが、128Dまでの印象的な特徴量のコンパクトさで、最先端の性能を達成するグローバル特徴量を生成できることを示す。さらに、再順位付けのために我々の局所的特徴量を統合することで、この性能差はさらに広がる。本手法はまた、オクルージョン、昼夜の移り変わり、季節変動などの困難な条件に対応しながら、卓越した頑健性と一般性を示し、最先端の性能を設定する。

要約(オリジナル)

The task of Visual Place Recognition (VPR) is to predict the location of a query image from a database of geo-tagged images. Recent studies in VPR have highlighted the significant advantage of employing pre-trained foundation models like DINOv2 for the VPR task. However, these models are often deemed inadequate for VPR without further fine-tuning on VPR-specific data. In this paper, we present an effective approach to harness the potential of a foundation model for VPR. We show that features extracted from self-attention layers can act as a powerful re-ranker for VPR, even in a zero-shot setting. Our method not only outperforms previous zero-shot approaches but also introduces results competitive with several supervised methods. We then show that a single-stage approach utilizing internal ViT layers for pooling can produce global features that achieve state-of-the-art performance, with impressive feature compactness down to 128D. Moreover, integrating our local foundation features for re-ranking further widens this performance gap. Our method also demonstrates exceptional robustness and generalization, setting new state-of-the-art performance, while handling challenging conditions such as occlusion, day-night transitions, and seasonal variations.

arxiv情報

著者	Issar Tzachor,Boaz Lerner,Matan Levy,Michael Green,Tal Berkovitz Shalev,Gavriel Habib,Dvir Samuel,Noam Korngut Zailer,Or Shimshi,Nir Darshan,Rami Ben-Ari
発行日	2025-02-02 22:46:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

EffoVPR: Effective Foundation Model Utilization for Visual Place Recognition

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー