Affordances-Oriented Planning using Foundation Models for Continuous Vision-Language Navigation

要約

LLM ベースのエージェントは、ビジョン言語ナビゲーション (VLN) タスクにおいて素晴らしいゼロショットパフォーマンスを実証しました。
ただし、これらのゼロショット手法は、事前定義されたナビゲーショングラフで移動用のノードを選択することによって高レベルのタスク計画を解決することのみに焦点を当てており、現実的なナビゲーションシナリオにおける低レベルの制御は無視されています。
このギャップを埋めるために、継続的な VLN タスクのための新しいアフォーダンス指向の計画フレームワークである AO-Planner を提案します。
当社の AO プランナーは、さまざまな基礎モデルを統合して、アフォーダンス指向の動作計画とアクションの意思決定を実現し、どちらもゼロショット方式で実行されます。
具体的には、視覚的アフォーダンスプロンプト (VAP) アプローチを採用しています。このアプローチでは、SAM を利用して目に見える地面がセグメント化され、ナビゲーションアフォーダンスが提供されます。これに基づいて、LLM は潜在的な次のウェイポイントを選択し、選択したウェイポイントに向かう低レベルの経路計画を生成します。
さらに、最も可能性の高いピクセルベースのパスを特定し、それを 3D 座標に変換して低レベルの動きを実現する高レベルエージェント PathAgent を導入します。
挑戦的な R2R-CE ベンチマークの実験結果は、AO-Planner が最先端のゼロショットパフォーマンス (SPL で 5.5% 向上) を達成していることを示しています。
私たちの方法は、LLM と 3D 世界の間の効果的な接続を確立して、世界座標を直接予測する困難を回避し、低レベルのモーション制御で基礎モデルを使用するための新しい見通しを示します。

要約(オリジナル)

LLM-based agents have demonstrated impressive zero-shot performance in the vision-language navigation (VLN) task. However, these zero-shot methods focus only on solving high-level task planning by selecting nodes in predefined navigation graphs for movements, overlooking low-level control in realistic navigation scenarios. To bridge this gap, we propose AO-Planner, a novel affordances-oriented planning framework for continuous VLN task. Our AO-Planner integrates various foundation models to achieve affordances-oriented motion planning and action decision-making, both performed in a zero-shot manner. Specifically, we employ a visual affordances prompting (VAP) approach, where visible ground is segmented utilizing SAM to provide navigational affordances, based on which the LLM selects potential next waypoints and generates low-level path planning towards selected waypoints. We further introduce a high-level agent, PathAgent, to identify the most probable pixel-based path and convert it into 3D coordinates to fulfill low-level motion. Experimental results on the challenging R2R-CE benchmark demonstrate that AO-Planner achieves state-of-the-art zero-shot performance (5.5% improvement in SPL). Our method establishes an effective connection between LLM and 3D world to circumvent the difficulty of directly predicting world coordinates, presenting novel prospects for employing foundation models in low-level motion control.

arxiv情報

著者	Jiaqi Chen,Bingqian Lin,Xinmin Liu,Xiaodan Liang,Kwan-Yee K. Wong
発行日	2024-07-08 12:52:46+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Affordances-Oriented Planning using Foundation Models for Continuous Vision-Language Navigation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー