Affordances-Oriented Planning using Foundation Models for Continuous Vision-Language Navigation

要約

LLM ベースのエージェントは、ビジョン言語ナビゲーション (VLN) タスクにおいて素晴らしいゼロショットパフォーマンスを実証しました。
ただし、既存の LLM ベースの手法は、移動に対して事前定義されたナビゲーショングラフ内のノードを選択することで高レベルのタスク計画を解決することのみに重点を置き、ナビゲーションシナリオにおける低レベルの制御を無視していることがよくあります。
このギャップを埋めるために、私たちは継続的な VLN タスクのための新しいアフォーダンス指向プランナーである AO-Planner を提案します。
当社の AO プランナーは、さまざまな基礎モデルを統合して、アフォーダンス指向の低レベルの動作計画と高レベルの意思決定を実現し、どちらもゼロショット設定で実行されます。
具体的には、Visual Affordances Prompting (VAP) アプローチを採用しています。このアプローチでは、目に見える地面が SAM によってセグメント化されてナビゲーションアフォーダンスが提供され、それに基づいて LLM が潜在的な候補ウェイポイントを選択し、選択したウェイポイントに向かう低レベルのパスを計画します。
さらに、画像入力への計画されたパスをマークし、すべての環境情報を理解することで最も可能性の高いパスを推論する高レベルの PathAgent を提案します。
最後に、カメラ固有のパラメーターと深度情報を使用して、選択したパスを 3D 座標に変換し、LLM の困難な 3D 予測を回避します。
挑戦的な R2R-CE および RxR-CE データセットの実験では、AO-Planner が最先端のゼロショットパフォーマンス (SPL で 8.8% 改善) を達成していることが示されています。
私たちの方法は、擬似ラベルを取得するためのデータアノテーターとしても機能し、そのウェイポイント予測機能を学習ベースの予測子に抽出することができます。
この新しい予測器はシミュレーターからのウェイポイントデータを必要とせず、教師あり手法と競合する 47% の SR を達成します。
私たちは LLM と 3D 世界の間の効果的な接続を確立し、低レベルのモーション制御で基礎モデルを採用するための新しい展望を示します。

要約(オリジナル)

LLM-based agents have demonstrated impressive zero-shot performance in vision-language navigation (VLN) task. However, existing LLM-based methods often focus only on solving high-level task planning by selecting nodes in predefined navigation graphs for movements, overlooking low-level control in navigation scenarios. To bridge this gap, we propose AO-Planner, a novel Affordances-Oriented Planner for continuous VLN task. Our AO-Planner integrates various foundation models to achieve affordances-oriented low-level motion planning and high-level decision-making, both performed in a zero-shot setting. Specifically, we employ a Visual Affordances Prompting (VAP) approach, where the visible ground is segmented by SAM to provide navigational affordances, based on which the LLM selects potential candidate waypoints and plans low-level paths towards selected waypoints. We further propose a high-level PathAgent which marks planned paths into the image input and reasons the most probable path by comprehending all environmental information. Finally, we convert the selected path into 3D coordinates using camera intrinsic parameters and depth information, avoiding challenging 3D predictions for LLMs. Experiments on the challenging R2R-CE and RxR-CE datasets show that AO-Planner achieves state-of-the-art zero-shot performance (8.8% improvement on SPL). Our method can also serve as a data annotator to obtain pseudo-labels, distilling its waypoint prediction ability into a learning-based predictor. This new predictor does not require any waypoint data from the simulator and achieves 47% SR competing with supervised methods. We establish an effective connection between LLM and 3D world, presenting novel prospects for employing foundation models in low-level motion control.

arxiv情報

著者	Jiaqi Chen,Bingqian Lin,Xinmin Liu,Lin Ma,Xiaodan Liang,Kwan-Yee K. Wong
発行日	2024-08-20 14:51:04+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Affordances-Oriented Planning using Foundation Models for Continuous Vision-Language Navigation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー