NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation

要約

Vision-and-Language Navigation (VLN)はEmbodied AIの重要な研究課題であり、言語的な指示に従ってエージェントが未知の環境をナビゲートできるようにすることを目的としている。この分野では、配信外のシーンやSimからRealへの汎化が長年の課題である。本論文では、このような汎化ギャップを緩和するために、ビデオベースの大規模ビジョン言語モデル（VLM）であるNaVidを提案する。NaVidは、地図やオドメータ、深度入力なしに、最先端レベルのナビゲーション性能を達成するVLMの能力を示す初めての試みである。NaVidは、人間の指示に従い、ロボットに搭載された単眼RGBカメラからのビデオストリームをその場で受信するだけで、次のステップの行動を出力する。また、走行距離計のノイズや、地図入力や奥行き入力のSim2Realギャップによる問題を解消する。さらに、我々のビデオベースのアプローチは、ロボットの過去の観測結果を、意思決定や指示に従うための時空間コンテキストとして効果的に符号化することができる。NaVidは、VLN-CEの軌跡から収集した550kのナビゲーションサンプルと、665kの大規模Webデータを用いて学習する。広範な実験により、NaVidはシミュレーション環境および実世界においてSOTA性能を達成し、優れたデータセット間およびSim2Real転送を実証した。このように、我々の提案するVLMアプローチは、ナビゲーションエージェントだけでなく、この研究分野においても次のステップを計画していると考えている。

要約(オリジナル)

Vision-and-Language Navigation (VLN) stands as a key research problem of Embodied AI, aiming at enabling agents to navigate in unseen environments following linguistic instructions. In this field, generalization is a long-standing challenge, either to out-of-distribution scenes or from Sim to Real. In this paper, we propose NaVid, a video-based large vision language model (VLM), to mitigate such a generalization gap. NaVid makes the first endeavour to showcase the capability of VLMs to achieve state-of-the-art level navigation performance without any maps, odometer and depth inputs. Following human instruction, NaVid only requires an on-the-fly video stream from a monocular RGB camera equipped on the robot to output the next-step action. Our formulation mimics how humans navigate and naturally gets rid of the problems introduced by odometer noises, and the Sim2Real gaps from map or depth inputs. Moreover, our video-based approach can effectively encode the historical observations of robots as spatio-temporal contexts for decision-making and instruction following. We train NaVid with 550k navigation samples collected from VLN-CE trajectories, including action-planning and instruction-reasoning samples, along with 665k large-scale web data. Extensive experiments show that NaVid achieves SOTA performance in simulation environments and the real world, demonstrating superior cross-dataset and Sim2Real transfer. We thus believe our proposed VLM approach plans the next step for not only the navigation agents but also this research field.

arxiv情報

著者	Jiazhao Zhang,Kunyu Wang,Rongtao Xu,Gengze Zhou,Yicong Hong,Xiaomeng Fang,Qi Wu,Zhizheng Zhang,Wang He
発行日	2024-03-01 05:09:02+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー