WMNav: Integrating Vision-Language Models into World Models for Object Goal Navigation

要約

オブジェクトの目標ナビゲーション – 目に見えない環境で特定のオブジェクトを特定するためにエージェントを要求することは、具体化されたAIの中心的な課題です。
Vision-Language Model（VLM）ベースのエージェントの最近の進歩は、プロンプトを通じて有望な認識と意思決定能力を実証していますが、世界の将来の状態を予測することで環境とのリスクと費用のかかる相互作用を減らす完全なモジュール式世界モデル設計をまだ確立していません。
Vision-Language Models（VLMS）を搭載した新しい世界モデルベースのナビゲーションフレームワークであるWMNAVを紹介します。
決定の可能な結果を予測し、ポリシーモジュールにフィードバックを提供するための記憶を構築します。
環境の予測状態を保持するために、WMNAVは、ナビゲーションポリシーの動的な構成を提供するために、世界モデルメモリの一部としてオンライン維持されているCuriosity Value Mapを提案します。
WMNAVは、人間のような思考プロセスに従って分解することにより、世界モデル計画と観察のフィードバックの違いに基づいて決定を下すことにより、モデルの幻覚の影響を効果的に軽減します。
さらに効率を高めるために、2段階のアクション提案者戦略を実装します。広範な探査に続いて、正確なローカリゼーションが続きます。
HM3DおよびMP3Dの広範な評価WMNAVは、成功率と探査効率の両方で既存のゼロショットベンチマークを上回ります（絶対改善： +3.2％SRおよび +3.2％SPLがHM3D、 +13.5％SRおよび +1.1％SPL）。
プロジェクトページ：https：//b0b8k1ng.github.io/wmnav/。

要約(オリジナル)

Object Goal Navigation-requiring an agent to locate a specific object in an unseen environment-remains a core challenge in embodied AI. Although recent progress in Vision-Language Model (VLM)-based agents has demonstrated promising perception and decision-making abilities through prompting, none has yet established a fully modular world model design that reduces risky and costly interactions with the environment by predicting the future state of the world. We introduce WMNav, a novel World Model-based Navigation framework powered by Vision-Language Models (VLMs). It predicts possible outcomes of decisions and builds memories to provide feedback to the policy module. To retain the predicted state of the environment, WMNav proposes the online maintained Curiosity Value Map as part of the world model memory to provide dynamic configuration for navigation policy. By decomposing according to a human-like thinking process, WMNav effectively alleviates the impact of model hallucination by making decisions based on the feedback difference between the world model plan and observation. To further boost efficiency, we implement a two-stage action proposer strategy: broad exploration followed by precise localization. Extensive evaluation on HM3D and MP3D validates WMNav surpasses existing zero-shot benchmarks in both success rate and exploration efficiency (absolute improvement: +3.2% SR and +3.2% SPL on HM3D, +13.5% SR and +1.1% SPL on MP3D). Project page: https://b0b8k1ng.github.io/WMNav/.

arxiv情報

著者	Dujun Nie,Xianda Guo,Yiqun Duan,Ruijun Zhang,Long Chen
発行日	2025-04-16 13:23:05+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

WMNav: Integrating Vision-Language Models into World Models for Object Goal Navigation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー