LangNav: Language as a Perceptual Representation for Navigation

要約

私たちは、視覚と言語のナビゲーションのための知覚表現としての言語の使用を探求します。
私たちのアプローチでは、既製のビジョンシステム (画像キャプションと物体検出用) を使用して、各タイムステップにおけるエージェントの自己中心的なパノラマビューを自然言語の説明に変換します。
次に、事前トレーニングされた言語モデルを微調整して、現在のビューと軌跡の履歴に基づいて、ナビゲーション指示を最もよく満たすアクションを選択します。
事前トレーニングされた視覚モデルからの連続視覚特徴を直接操作するように事前トレーニングされた言語モデルを適応させる標準的なセットアップとは対照的に、私たちのアプローチでは代わりに（個別の）言語を知覚表現として使用します。
R2R ビジョンと言語ナビゲーションベンチマークにおける言語ベースナビゲーション (LangNav) アプローチの 2 つのユースケースを検討します。1 つは、より小さな言語モデルを微調整するための、プロンプトを使用した大規模言語モデル (GPT-4) からの合成軌跡の生成です。
もう 1 つは、シミュレートされた環境 (ALFRED) で学習したポリシーを現実世界の環境 (R2R) に転送する、シミュレーションからリアルへの転送です。
私たちのアプローチは、少数のゴールド軌道 (10 ～ 100) しか利用できない設定で視覚的特徴に依存する強力なベースラインを改善することがわかり、ナビゲーションタスクの知覚表現として言語を使用する可能性を示しています。

要約(オリジナル)

We explore the use of language as a perceptual representation for vision-and-language navigation. Our approach uses off-the-shelf vision systems (for image captioning and object detection) to convert an agent’s egocentric panoramic view at each time step into natural language descriptions. We then finetune a pretrained language model to select an action, based on the current view and the trajectory history, that would best fulfill the navigation instructions. In contrast to the standard setup which adapts a pretrained language model to work directly with continuous visual features from pretrained vision models, our approach instead uses (discrete) language as the perceptual representation. We explore two use cases of our language-based navigation (LangNav) approach on the R2R vision-and-language navigation benchmark: generating synthetic trajectories from a prompted large language model (GPT-4) with which to finetune a smaller language model; and sim-to-real transfer where we transfer a policy learned on a simulated environment (ALFRED) to a real-world environment (R2R). Our approach is found to improve upon strong baselines that rely on visual features in settings where only a few gold trajectories (10-100) are available, demonstrating the potential of using language as a perceptual representation for navigation tasks.

arxiv情報

著者	Bowen Pan,Rameswar Panda,SouYoung Jin,Rogerio Feris,Aude Oliva,Phillip Isola,Yoon Kim
発行日	2023-10-11 20:52:30+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

LangNav: Language as a Perceptual Representation for Navigation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー