ClipRover: Zero-shot Vision-Language Exploration and Target Discovery by Mobile Robots

要約

Vision-Language Navigation（VLN）は有望なパラダイムとして浮上し、モバイルロボットがゼロショット推論を実行し、特定の事前プログラミングなしでタスクを実行できるようになりました。
ただし、現在のシステムは、しばしばマップ探査とパス計画を分離し、環境情報が限られているために非効率的なアルゴリズムに依存して探査が依存しています。
このホワイトペーパーでは、Clipという名前の視覚言語モデルの機能を活用するために、「Cliprover」という名前の新しいナビゲーションパイプラインを「Cliprover」と同時に探索し、ターゲット発見を紹介します。
私たちのアプローチには単眼視のみが必要であり、ターゲットに関する以前のマップや知識なしで動作します。
包括的な評価のために、汎用VLNタスク用のカスタマイズされたプラットフォームである「Rover Master」という名前のUGV（無人地上車両）システムの機能的プロトタイプを設計します。
Rover MasterにCliprover Pipelineを統合して展開して、さまざまな現実世界のシナリオにわたってスループット、障害物回避能力、および軌跡のパフォーマンスを評価します。
実験結果は、Cliproverが従来のマップトラバーサルアルゴリズムを常に上回り、以前のマップとターゲットの知識に依存するパス計画方法に匹敵するパフォーマンスを達成することを示しています。
特に、Cliproverは、既存のVLNパイプラインの重要な制限に対処する、事前にキャプチャされた候補の画像や事前に構築されたノードグラフを必要とせずに、リアルタイムのアクティブナビゲーションを提供します。

要約(オリジナル)

Vision-language navigation (VLN) has emerged as a promising paradigm, enabling mobile robots to perform zero-shot inference and execute tasks without specific pre-programming. However, current systems often separate map exploration and path planning, with exploration relying on inefficient algorithms due to limited (partially observed) environmental information. In this paper, we present a novel navigation pipeline named ”ClipRover” for simultaneous exploration and target discovery in unknown environments, leveraging the capabilities of a vision-language model named CLIP. Our approach requires only monocular vision and operates without any prior map or knowledge about the target. For comprehensive evaluations, we design the functional prototype of a UGV (unmanned ground vehicle) system named ”Rover Master”, a customized platform for general-purpose VLN tasks. We integrate and deploy the ClipRover pipeline on Rover Master to evaluate its throughput, obstacle avoidance capability, and trajectory performance across various real-world scenarios. Experimental results demonstrate that ClipRover consistently outperforms traditional map traversal algorithms and achieves performance comparable to path-planning methods that depend on prior map and target knowledge. Notably, ClipRover offers real-time active navigation without requiring pre-captured candidate images or pre-built node graphs, addressing key limitations of existing VLN pipelines.

arxiv情報

著者	Yuxuan Zhang,Adnan Abdullah,Sanjeev J. Koppal,Md Jahidul Islam
発行日	2025-02-12 21:07:10+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

ClipRover: Zero-shot Vision-Language Exploration and Target Discovery by Mobile Robots

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー