Mem2Ego: Empowering Vision-Language Models with Global-to-Ego Memory for Long-Horizon Embodied Navigation

要約

大規模な言語モデル（LLMS）およびビジョン言語モデル（VLM）の最近の進歩により、具体化されたナビゲーションの強力なツールが作成され、エージェントが馴染みのない環境での効率的な調査のためのコモンセンスと空間的推論を活用できます。
既存のLLMベースのアプローチは、セマンティックマップやトポロジマップなどのグローバルメモリを言語の説明に変換して、ナビゲーションをガイドします。
これにより効率が向上し、冗長な探索が減少しますが、言語ベースの表現における幾何学的情報の喪失は、特に複雑な環境での空間的推論を妨げます。
これに対処するために、VLMベースのアプローチは、エゴ中心の視覚入力を直接処理して、探索の最適な方向を選択します。
ただし、一人称の視点にのみ依存することで、ナビゲーションは部分的に観察された意思決定の問題となり、複雑な環境で最適ではない決定につながります。
このペーパーでは、グローバルメモリモジュールからタスク関連のキューを適応的に取得し、エージェントのエゴセントリック観測と統合することにより、これらの課題に対処する新しいビジョン言語モデル（VLM）ベースのナビゲーションフレームワークを紹介します。
グローバルなコンテキスト情報をローカルの認識と動的に整合することにより、私たちのアプローチは、長老課題における空間的推論と意思決定を強化します。
実験結果は、提案された方法がオブジェクトナビゲーションタスクで以前の最先端のアプローチを上回り、具体化されたナビゲーションのためのより効果的でスケーラブルなソリューションを提供することを示しています。

要約(オリジナル)

Recent advancements in Large Language Models (LLMs) and Vision-Language Models (VLMs) have made them powerful tools in embodied navigation, enabling agents to leverage commonsense and spatial reasoning for efficient exploration in unfamiliar environments. Existing LLM-based approaches convert global memory, such as semantic or topological maps, into language descriptions to guide navigation. While this improves efficiency and reduces redundant exploration, the loss of geometric information in language-based representations hinders spatial reasoning, especially in intricate environments. To address this, VLM-based approaches directly process ego-centric visual inputs to select optimal directions for exploration. However, relying solely on a first-person perspective makes navigation a partially observed decision-making problem, leading to suboptimal decisions in complex environments. In this paper, we present a novel vision-language model (VLM)-based navigation framework that addresses these challenges by adaptively retrieving task-relevant cues from a global memory module and integrating them with the agent’s egocentric observations. By dynamically aligning global contextual information with local perception, our approach enhances spatial reasoning and decision-making in long-horizon tasks. Experimental results demonstrate that the proposed method surpasses previous state-of-the-art approaches in object navigation tasks, providing a more effective and scalable solution for embodied navigation.

arxiv情報

著者	Lingfeng Zhang,Yuecheng Liu,Zhanguang Zhang,Matin Aghaei,Yaochen Hu,Hongjian Gu,Mohammad Ali Alomrani,David Gamaliel Arcos Bravo,Raika Karimi,Atia Hamidizadeh,Haoping Xu,Guowei Huang,Zhanpeng Zhang,Tongtong Cao,Weichao Qiu,Xingyue Quan,Jianye Hao,Yuzheng Zhuang,Yingxue Zhang
発行日	2025-02-20 04:41:40+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Mem2Ego: Empowering Vision-Language Models with Global-to-Ego Memory for Long-Horizon Embodied Navigation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー