Can Pretrained Vision-Language Embeddings Alone Guide Robot Navigation?

要約

ファンデーションモデルは、タスク固有のトレーニングなしで豊富なセマンティック表現を提供することにより、ロボット工学に革命をもたらしました。
多くのアプローチでは、前処理されたビジョン言語モデル（VLMS）を専門のナビゲーションアーキテクチャと統合しますが、基本的な問題は残ります。これらの前処理された埋め込みだけで、追加の微調整や特殊なモジュールなしでナビゲーションをうまくガイドできますか？
私たちは、特権の専門家によって収集されたデモンストレーションからの凍結ビジョン言語の埋め込みに関する行動クローンを直接トレーニングすることにより、この質問を切り離すミニマリストのフレームワークを提示します。
私たちのアプローチは、国家認識の専門家の100％と比較して、言語指定された目標へのナビゲーションで74％の成功率を達成しますが、平均で3.2倍のステップを必要とします。
このパフォーマンスのギャップは、前処理された埋め込みが基本的な言語の接地を効果的にサポートしているが、長期の計画と空間的推論に苦労していることを明らかにしています。
この経験的ベースラインを提供することにより、基礎モデルを具体化されたタスクのドロップイン表現として使用する能力と制限の両方を強調し、リソースが制約されたシナリオでのシステムの複雑さとパフォーマンスの間の実用的なデザイントレードオフに直面しているロボット工学研究者に重要な洞察を提供します。
私たちのコードは、https：//github.com/oadamharoon/text2navで入手できます

要約(オリジナル)

Foundation models have revolutionized robotics by providing rich semantic representations without task-specific training. While many approaches integrate pretrained vision-language models (VLMs) with specialized navigation architectures, the fundamental question remains: can these pretrained embeddings alone successfully guide navigation without additional fine-tuning or specialized modules? We present a minimalist framework that decouples this question by training a behavior cloning policy directly on frozen vision-language embeddings from demonstrations collected by a privileged expert. Our approach achieves a 74% success rate in navigation to language-specified targets, compared to 100% for the state-aware expert, though requiring 3.2 times more steps on average. This performance gap reveals that pretrained embeddings effectively support basic language grounding but struggle with long-horizon planning and spatial reasoning. By providing this empirical baseline, we highlight both the capabilities and limitations of using foundation models as drop-in representations for embodied tasks, offering critical insights for robotics researchers facing practical design tradeoffs between system complexity and performance in resource-constrained scenarios. Our code is available at https://github.com/oadamharoon/text2nav

arxiv情報

著者	Nitesh Subedi,Adam Haroon,Shreyan Ganguly,Samuel T. K. Tetteh,Prajwal Koirala,Cody Fleming,Soumik Sarkar
発行日	2025-06-17 13:31:05+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Can Pretrained Vision-Language Embeddings Alone Guide Robot Navigation?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー