Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks

要約

実用的なナビゲーションエージェントは、指示に従う、オブジェクトを検索する、質問に答える、人々を追跡するなど、幅広い対話要求を処理できなければなりません。
身体化されたナビゲーション用の既存のモデルは、特定のタスク構成や離散化されたウェイポイントを含む事前定義されたマップによって制約されることが多いため、現実世界で実用的なジェネラリストとして機能するには不十分です。
この研究では、Uni-NaVid を紹介します。Uni-NaVid は、多様な身体化されたナビゲーションタスクを統合し、目に見えない現実世界の環境で混合された長期水平タスクのシームレスなナビゲーションを可能にするように設計された初のビデオベースのビジョン言語アクション (VLA) モデルです。
Uni-NaVid は、一般的に使用されるすべての具体化されたナビゲーションタスクの入出力データ構成を調和させ、それによってすべてのタスクを 1 つのモデルに統合することによってこれを実現します。
Uni-NaVid のトレーニングでは、4 つの重要なナビゲーションサブタスクから合計 360 万のナビゲーションデータサンプルを収集し、それら全体での学習の相乗効果を促進します。
包括的なナビゲーションベンチマークに関する広範な実験により、Uni-NaVid の統合モデリングの利点が明確に実証され、Uni-NaVid が最先端のパフォーマンスを達成することが示されました。
さらに、実際の実験によりモデルの有効性と効率性が確認され、その強力な一般化可能性が明らかになりました。

要約(オリジナル)

A practical navigation agent must be capable of handling a wide range of interaction demands, such as following instructions, searching objects, answering questions, tracking people, and more. Existing models for embodied navigation fall short of serving as practical generalists in the real world, as they are often constrained by specific task configurations or pre-defined maps with discretized waypoints. In this work, we present Uni-NaVid, the first video-based vision-language-action (VLA) model designed to unify diverse embodied navigation tasks and enable seamless navigation for mixed long-horizon tasks in unseen real-world environments. Uni-NaVid achieves this by harmonizing the input and output data configurations for all commonly used embodied navigation tasks and thereby integrating all tasks in one model. For training Uni-NaVid, we collect 3.6 million navigation data samples in total from four essential navigation sub-tasks and foster synergy in learning across them. Extensive experiments on comprehensive navigation benchmarks clearly demonstrate the advantages of unification modeling in Uni-NaVid and show it achieves state-of-the-art performance. Additionally, real-world experiments confirm the model’s effectiveness and efficiency, shedding light on its strong generalizability.

arxiv情報

著者	Jiazhao Zhang,Kunyu Wang,Shaoan Wang,Minghan Li,Haoran Liu,Songlin Wei,Zhongyuan Wang,Zhizheng Zhang,He Wang
発行日	2024-12-09 05:55:55+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー