NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation

要約

Vision-and-Language Navigation (VLN) は、身体化 AI の主要な研究課題であり、エージェントが言語の指示に従って目に見えない環境をナビゲートできるようにすることを目的としています。
この分野では、配布外のシーンやシムからリアルへの一般化が長年の課題です。
この論文では、このような一般化ギャップを軽減するために、ビデオベースのラージビジョン言語モデル (VLM) である NaVid を提案します。
NaVid は、地図、走行距離計、深度の入力なしで最先端レベルのナビゲーションパフォーマンスを達成する VLM の機能を紹介する最初の試みを行います。
人間の指示に従って、NaVid は次のステップのアクションを出力するために、ロボットに装備された単眼 RGB カメラからのオンザフライビデオストリームのみを必要とします。
私たちの定式化は人間のナビゲーション方法を模倣しており、走行距離計のノイズやマップまたは深度入力からの Sim2Real のギャップによってもたらされる問題を自然に取り除きます。
さらに、私たちのビデオベースのアプローチは、意思決定と指示に従うための時空間コンテキストとしてロボットの歴史的観察を効果的にエンコードできます。
私たちは、VLN-CE 軌道から収集された 550,000 のナビゲーションサンプル (行動計画および指示推論のサンプルを含む) と 665,000 の大規模 Web データを使用して NaVid をトレーニングします。
広範な実験により、NaVid がシミュレーション環境と現実世界で SOTA パフォーマンスを達成し、優れたデータセット間転送と Sim2Real 転送を実証したことが示されています。
したがって、私たちが提案する VLM アプローチは、ナビゲーションエージェントだけでなく、この研究分野の次のステップを計画していると考えています。

要約(オリジナル)

Vision-and-Language Navigation (VLN) stands as a key research problem of Embodied AI, aiming at enabling agents to navigate in unseen environments following linguistic instructions. In this field, generalization is a long-standing challenge, either to out-of-distribution scenes or from Sim to Real. In this paper, we propose NaVid, a video-based large vision language model (VLM), to mitigate such a generalization gap. NaVid makes the first endeavour to showcase the capability of VLMs to achieve state-of-the-art level navigation performance without any maps, odometer and depth inputs. Following human instruction, NaVid only requires an on-the-fly video stream from a monocular RGB camera equipped on the robot to output the next-step action. Our formulation mimics how humans navigate and naturally gets rid of the problems introduced by odometer noises, and the Sim2Real gaps from map or depth inputs. Moreover, our video-based approach can effectively encode the historical observations of robots as spatio-temporal contexts for decision-making and instruction following. We train NaVid with 550k navigation samples collected from VLN-CE trajectories, including action-planning and instruction-reasoning samples, along with 665k large-scale web data. Extensive experiments show that NaVid achieves SOTA performance in simulation environments and the real world, demonstrating superior cross-dataset and Sim2Real transfer. We thus believe our proposed VLM approach plans the next step for not only the navigation agents but also this research field.

arxiv情報

著者	Jiazhao Zhang,Kunyu Wang,Rongtao Xu,Gengze Zhou,Yicong Hong,Xiaomeng Fang,Qi Wu,Zhizheng Zhang,Wang He
発行日	2024-02-24 16:39:16+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー