Agent Journey Beyond RGB: Unveiling Hybrid Semantic-Spatial Environmental Representations for Vision-and-Language Navigation

要約

自然言語の指示に基づいて目に見えない環境をナビゲートすることは、視覚航海航法（VLN）におけるエゴセントリックエージェントにとって依然として困難です。
既存のアプローチは、主に環境表現のためにRGB画像に依存し、潜在的なテキストセマンティックと空間的キューを十分に活用し、指示と希少な環境表現の間のモダリティギャップを解決しません。
直感的には、人間は本質的に屋内ナビゲーション中の空間レイアウト内でセマンティックな知識を根拠にします。
これに触発されて、私たちは、エージェントが多様な観点から環境を接地するようにエージェントを奨励するために、多目的なセマンティック理解と空間認識（SUSA）アーキテクチャを提案します。
SUSAには、テキストセマンティック理解（TSU）モジュールが含まれています。これは、エージェントのすぐ近くの環境に環境ランドマークの説明を生成および関連付けることにより、指示と環境の間のモダリティギャップを狭めます。
さらに、深さ強化された空間知覚（DSP）モジュールは、深さ探索マップを徐々に構築し、環境レイアウトのより微妙な理解を可能にします。
実験は、SUSAのハイブリッドセマンティック空間表現がナビゲーションパフォーマンスを効果的に強化し、3つのVLNベンチマーク（Reverie、R2R、およびSOON）に新しい最先端のパフォーマンスを設定することを示しています。
ソースコードは公開されます。

要約(オリジナル)

Navigating unseen environments based on natural language instructions remains difficult for egocentric agents in Vision-and-Language Navigation (VLN). Existing approaches primarily rely on RGB images for environmental representation, underutilizing latent textual semantic and spatial cues and leaving the modality gap between instructions and scarce environmental representations unresolved. Intuitively, humans inherently ground semantic knowledge within spatial layouts during indoor navigation. Inspired by this, we propose a versatile Semantic Understanding and Spatial Awareness (SUSA) architecture to encourage agents to ground environment from diverse perspectives. SUSA includes a Textual Semantic Understanding (TSU) module, which narrows the modality gap between instructions and environments by generating and associating the descriptions of environmental landmarks in agent’s immediate surroundings. Additionally, a Depth-enhanced Spatial Perception (DSP) module incrementally constructs a depth exploration map, enabling a more nuanced comprehension of environmental layouts. Experiments demonstrate that SUSA’s hybrid semantic-spatial representations effectively enhance navigation performance, setting new state-of-the-art performance across three VLN benchmarks (REVERIE, R2R, and SOON). The source code will be publicly available.

arxiv情報

著者	Xuesong Zhang,Yunbo Xu,Jia Li,Zhenzhen Hu,Richnag Hong
発行日	2025-04-07 13:57:01+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Agent Journey Beyond RGB: Unveiling Hybrid Semantic-Spatial Environmental Representations for Vision-and-Language Navigation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー