GeoVLN: Learning Geometry-Enhanced Visual Representation with Slot Attention for Vision-and-Language Navigation

要約

Room-to-Room VLN 問題を解決する既存の研究のほとんどは、RGB 画像のみを利用し、候補ビューの周囲のローカルコンテキストを考慮していないため、周囲の環境に関する十分な視覚的手がかりが不足しています。
さらに、自然言語には複雑な意味情報が含まれているため、視覚入力との相関関係を単に交差注意だけでモデル化することは困難です。
この論文では、堅牢な視覚と言語のナビゲーションのためにスロットアテンションに基づいてジオメトリ強化された視覚表現を学習する GeoVLN を提案します。
RGB 画像は、視覚入力として Omnidata によって予測された、対応する深度マップと法線マップで補正されます。
技術的には、ローカルスロットアテンションと CLIP モデルを組み合わせて、そのような入力からジオメトリ強化された表現を生成する 2 段階のモジュールを導入します。
V&L BERT を使用して、言語情報と視覚情報の両方を組み込んだクロスモーダル表現を学習します。
さらに、新しい多方向注意モジュールが設計されており、入力命令のさまざまなフレーズが視覚入力から最も関連性の高い機能を活用できるように促します。
広範な実験により、新しく設計されたモジュールの有効性が実証され、提案された方法の説得力のあるパフォーマンスが示されています。

要約(オリジナル)

Most existing works solving Room-to-Room VLN problem only utilize RGB images and do not consider local context around candidate views, which lack sufficient visual cues about surrounding environment. Moreover, natural language contains complex semantic information thus its correlations with visual inputs are hard to model merely with cross attention. In this paper, we propose GeoVLN, which learns Geometry-enhanced visual representation based on slot attention for robust Visual-and-Language Navigation. The RGB images are compensated with the corresponding depth maps and normal maps predicted by Omnidata as visual inputs. Technically, we introduce a two-stage module that combine local slot attention and CLIP model to produce geometry-enhanced representation from such input. We employ V&L BERT to learn a cross-modal representation that incorporate both language and vision informations. Additionally, a novel multiway attention module is designed, encouraging different phrases of input instruction to exploit the most related features from visual input. Extensive experiments demonstrate the effectiveness of our newly designed modules and show the compelling performance of the proposed method.

arxiv情報

著者	Jingyang Huo,Qiang Sun,Boyan Jiang,Haitao Lin,Yanwei Fu
発行日	2023-05-26 17:15:22+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

GeoVLN: Learning Geometry-Enhanced Visual Representation with Slot Attention for Vision-and-Language Navigation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー