Learning Navigational Visual Representations with Semantic Map Supervision

要約

家庭用ロボットの視覚的なナビゲーションには、環境の意味論と空間構造を認識できることが不可欠です。
しかし、既存の研究のほとんどは、分類用の独立した画像または屋内ナビゲーション領域に適応する自己教師あり学習方法のいずれかで事前にトレーニングされた視覚的バックボーンのみを採用しており、ナビゲーションの学習に不可欠な空間関係を無視しています。
人間がナビゲーション中に意味的および空間的に意味のある認知マップを脳内に自然に構築する行動に触発され、本論文では、エージェントの自己中心的なビューと意味マップ (Ego$^2$-Map) を対比させることによって、ナビゲーションに特化した新しい視覚表現学習方法を提案します。
バックボーンエンコーダーとしてビジュアルトランスフォーマーを適用し、大規模な Habitat-Matterport3D 環境から収集されたデータを使用してモデルをトレーニングします。
Ego$^2$-Map 学習は、オブジェクト、構造、遷移などのコンパクトで豊富な情報をマップからナビゲーションのためのエージェントの自己中心的表現に転送します。
実験によれば、オブジェクトとゴールのナビゲーションに関して学習した表現を使用するエージェントは、最近の視覚的な事前トレーニング方法よりも優れたパフォーマンスを発揮します。
さらに、当社の表現は、高レベルと低レベルのアクションスペースの両方の連続環境における視覚と言語のナビゲーションを大幅に改善し、テストサーバーで 47% SR と 41% SPL という新しい最先端の結果を達成しました。

要約(オリジナル)

Being able to perceive the semantics and the spatial structure of the environment is essential for visual navigation of a household robot. However, most existing works only employ visual backbones pre-trained either with independent images for classification or with self-supervised learning methods to adapt to the indoor navigation domain, neglecting the spatial relationships that are essential to the learning of navigation. Inspired by the behavior that humans naturally build semantically and spatially meaningful cognitive maps in their brains during navigation, in this paper, we propose a novel navigational-specific visual representation learning method by contrasting the agent’s egocentric views and semantic maps (Ego$^2$-Map). We apply the visual transformer as the backbone encoder and train the model with data collected from the large-scale Habitat-Matterport3D environments. Ego$^2$-Map learning transfers the compact and rich information from a map, such as objects, structure and transition, to the agent’s egocentric representations for navigation. Experiments show that agents using our learned representations on object-goal navigation outperform recent visual pre-training methods. Moreover, our representations significantly improve vision-and-language navigation in continuous environments for both high-level and low-level action spaces, achieving new state-of-the-art results of 47% SR and 41% SPL on the test server.

arxiv情報

著者	Yicong Hong,Yang Zhou,Ruiyi Zhang,Franck Dernoncourt,Trung Bui,Stephen Gould,Hao Tan
発行日	2023-07-23 14:01:05+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Learning Navigational Visual Representations with Semantic Map Supervision

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー