Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks

要約

実用的なナビゲーションエージェントは、次の指示、オブジェクトの検索、質問への回答、人の追跡など、幅広いインタラクションの要求を処理できる必要があります。
具体化されたナビゲーションの既存のモデルは、特定のタスク構成または離散化されたウェイポイントを備えた事前定義されたマップによって制約されることが多いため、現実の世界で実用的なジェネラリストとして機能することには至りません。
この作業では、多様な具体化されたナビゲーションタスクを統一し、目に見えない実際の環境での長距離混合タスクのシームレスなナビゲーションを有効にするために設計された最初のビデオベースのビジョン言語アクション（VLA）モデルであるUni-Navidを提示します。
Uni-navidは、一般的に使用されるすべての具体化されたナビゲーションタスクの入力データ構成と出力データ構成を調和させ、1つのモデルにすべてのタスクを統合することにより、これを達成します。
Uni-navidをトレーニングするために、4つの重要なナビゲーションサブタスクから合計360万のナビゲーションデータサンプルを収集し、それらの学習における相乗効果を促進します。
包括的なナビゲーションベンチマークに関する広範な実験は、Uni-Navidの統一モデリングの利点を明確に示し、最先端のパフォーマンスを達成することを示しています。
さらに、実際の実験では、モデルの有効性と効率性が確認され、その強力な一般化可能性に光を当てます。

要約(オリジナル)

A practical navigation agent must be capable of handling a wide range of interaction demands, such as following instructions, searching objects, answering questions, tracking people, and more. Existing models for embodied navigation fall short of serving as practical generalists in the real world, as they are often constrained by specific task configurations or pre-defined maps with discretized waypoints. In this work, we present Uni-NaVid, the first video-based vision-language-action (VLA) model designed to unify diverse embodied navigation tasks and enable seamless navigation for mixed long-horizon tasks in unseen real-world environments. Uni-NaVid achieves this by harmonizing the input and output data configurations for all commonly used embodied navigation tasks and thereby integrating all tasks in one model. For training Uni-NaVid, we collect 3.6 million navigation data samples in total from four essential navigation sub-tasks and foster synergy in learning across them. Extensive experiments on comprehensive navigation benchmarks clearly demonstrate the advantages of unification modeling in Uni-NaVid and show it achieves state-of-the-art performance. Additionally, real-world experiments confirm the model’s effectiveness and efficiency, shedding light on its strong generalizability.

arxiv情報

著者	Jiazhao Zhang,Kunyu Wang,Shaoan Wang,Minghan Li,Haoran Liu,Songlin Wei,Zhongyuan Wang,Zhizheng Zhang,He Wang
発行日	2025-02-06 10:14:36+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー