TopV-Nav: Unlocking the Top-View Spatial Reasoning Potential of MLLM for Zero-shot Object Navigation

要約

ゼロショットオブジェクトナビゲーション（ZSON）タスクでは、具体化されていない環境でナビゲートすることにより、具体化されたエージェントが以前に見えなかったオブジェクトを見つける必要があります。
このような目標指向の探査は、環境の空間情報に基づいて、認識、理解、および推論する能力に大きく依存しています。
ただし、現在のLLMベースのアプローチは、視覚的観察を言語空間の言語の説明と理由に変換し、空間情報の喪失につながります。
このホワイトペーパーでは、十分な空間情報を備えたトップビューマップに直接理由があるMLLMベースの方法であるTOPV-NAVを紹介します。
Top-Viewの観点でMLLMの空間推論の可能性を完全にロック解除するために、セマンティックに豊富なトップビューマップを適応的に構築するための適応視覚プロンプト生成（AVPG）メソッドを提案します。
これにより、エージェントは、トップビューマップに含まれる空間情報を直接使用して、徹底的な推論を実施できます。
また、優先スケールで動的にズームトップビューマップをズームし、ローカルの細かい推論を強化する動的マップスケーリング（DMS）メカニズムを設計します。
さらに、ターゲット駆動型のターゲット駆動型（PTD）メカニズムを考案して、ターゲットの場所を予測し、利用し、グローバルおよび人間のような探査を促進します。
MP3DおよびHM3Dデータセットの実験は、TOPV-NAVの優位性を示しています。

要約(オリジナル)

The Zero-Shot Object Navigation (ZSON) task requires embodied agents to find a previously unseen object by navigating in unfamiliar environments. Such a goal-oriented exploration heavily relies on the ability to perceive, understand, and reason based on the spatial information of the environment. However, current LLM-based approaches convert visual observations to language descriptions and reason in the linguistic space, leading to the loss of spatial information. In this paper, we introduce TopV-Nav, an MLLM-based method that directly reasons on the top-view map with sufficient spatial information. To fully unlock the MLLM’s spatial reasoning potential in top-view perspective, we propose the Adaptive Visual Prompt Generation (AVPG) method to adaptively construct semantically-rich top-view map. It enables the agent to directly utilize spatial information contained in the top-view map to conduct thorough reasoning. Besides, we design a Dynamic Map Scaling (DMS) mechanism to dynamically zoom top-view map at preferred scales, enhancing local fine-grained reasoning. Additionally, we devise a Potential Target Driven (PTD) mechanism to predict and to utilize target locations, facilitating global and human-like exploration. Experiments on MP3D and HM3D datasets demonstrate the superiority of our TopV-Nav.

arxiv情報

著者	Linqing Zhong,Chen Gao,Zihan Ding,Yue Liao,Huimin Ma,Shifeng Zhang,Xu Zhou,Si Liu
発行日	2025-03-26 07:26:43+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

TopV-Nav: Unlocking the Top-View Spatial Reasoning Potential of MLLM for Zero-shot Object Navigation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー