Zero-shot Object Navigation with Vision-Language Models Reasoning

要約

オブジェクトのナビゲーションはロボットにとって重要ですが、従来の方法では大量のトレーニングデータが必要であり、未知の環境に一般化することはできません。
ゼロショットオブジェクトナビゲーション (ZSON) は、この課題に対処することを目的としており、ロボットが特定のトレーニングデータなしで未知のオブジェクトと対話できるようにします。
言語駆動ゼロショットオブジェクトナビゲーション (L-ZSON) は、ロボットのナビゲーションとオブジェクトとの対話をガイドする自然言語命令を組み込んだ ZSON の拡張機能です。
この論文では、L-ZSON の思考ツリーネットワーク (VLTNet) を備えた新しいビジョン言語モデルを提案します。
VLTNet は、ビジョン言語モデルの理解、セマンティックマッピング、思考ツリーの推論と探索、目標の特定という 4 つの主要モジュールで構成されています。
これらのモジュールのうち、思考ツリー (ToT) 推論および探索モジュールはコアコンポーネントとして機能し、ロボット探索中のナビゲーションフロンティア選択に ToT 推論フレームワークを革新的に使用します。
従来の推論を伴わないフロンティア選択と比較して、ToT 推論を使用したナビゲーションでは、マルチパス推論プロセスと必要に応じたバックトラッキングが含まれるため、グローバルな情報に基づいたより正確な意思決定が可能になります。
PASTURE および RoboTHOR ベンチマークの実験結果は、特にターゲット命令として複雑な自然言語を含むシナリオにおいて、LZSON におけるモデルの優れたパフォーマンスを示しています。

要約(オリジナル)

Object navigation is crucial for robots, but traditional methods require substantial training data and cannot be generalized to unknown environments. Zero-shot object navigation (ZSON) aims to address this challenge, allowing robots to interact with unknown objects without specific training data. Language-driven zero-shot object navigation (L-ZSON) is an extension of ZSON that incorporates natural language instructions to guide robot navigation and interaction with objects. In this paper, we propose a novel Vision Language model with a Tree-of-thought Network (VLTNet) for L-ZSON. VLTNet comprises four main modules: vision language model understanding, semantic mapping, tree-of-thought reasoning and exploration, and goal identification. Among these modules, Tree-of-Thought (ToT) reasoning and exploration module serves as a core component, innovatively using the ToT reasoning framework for navigation frontier selection during robot exploration. Compared to conventional frontier selection without reasoning, navigation using ToT reasoning involves multi-path reasoning processes and backtracking when necessary, enabling globally informed decision-making with higher accuracy. Experimental results on PASTURE and RoboTHOR benchmarks demonstrate the outstanding performance of our model in LZSON, particularly in scenarios involving complex natural language as target instructions.

arxiv情報

著者	Congcong Wen,Yisiyuan Huang,Hao Huang,Yanjia Huang,Shuaihang Yuan,Yu Hao,Hui Lin,Yu-Shen Liu,Yi Fang
発行日	2024-10-24 09:24:07+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Zero-shot Object Navigation with Vision-Language Models Reasoning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー