VLM Can Be a Good Assistant: Enhancing Embodied Visual Tracking with Self-Improving Vision-Language Models

要約

環境視覚モデル（VLM）を使用して具体化された視覚追跡（EVT）を強化する新しい自己改善フレームワークを導入して、追跡障害から回復する現在のアクティブな視覚追跡システムの制限に対処します。
私たちのアプローチでは、既製のアクティブ追跡方法とVLMSの推論機能を組み合わせて、障害検出時にのみ通常の追跡とVLM推論をアクティブにするための高速視覚ポリシーを展開します。
このフレームワークは、3D空間推論におけるVLMの制限に効果的に対処し、過去の経験から学習することでVLMが徐々に改善できるようにするメモリの高度の自己反射メカニズムを特徴としています。
実験結果は大幅なパフォーマンスの改善を示し、フレームワークは、最先端のRLベースのアプローチで72ドル\％$、挑戦的な環境でのPIDベースの方法を備えた220 \％$ $を$ 72 \％$増加させます。
この作業は、VLMベースの推論の最初の統合を表し、EVTエージェントがプロアクティブな障害回復を支援することを表し、動的で非構造化された環境で継続的なターゲットモニタリングを必要とする実際のロボットアプリケーションの実質的な進歩を提供します。
プロジェクトWebサイト：https：//sites.google.com/view/evt-recovery-assistant。

要約(オリジナル)

We introduce a novel self-improving framework that enhances Embodied Visual Tracking (EVT) with Vision-Language Models (VLMs) to address the limitations of current active visual tracking systems in recovering from tracking failure. Our approach combines the off-the-shelf active tracking methods with VLMs’ reasoning capabilities, deploying a fast visual policy for normal tracking and activating VLM reasoning only upon failure detection. The framework features a memory-augmented self-reflection mechanism that enables the VLM to progressively improve by learning from past experiences, effectively addressing VLMs’ limitations in 3D spatial reasoning. Experimental results demonstrate significant performance improvements, with our framework boosting success rates by $72\%$ with state-of-the-art RL-based approaches and $220\%$ with PID-based methods in challenging environments. This work represents the first integration of VLM-based reasoning to assist EVT agents in proactive failure recovery, offering substantial advances for real-world robotic applications that require continuous target monitoring in dynamic, unstructured environments. Project website: https://sites.google.com/view/evt-recovery-assistant.

arxiv情報

著者	Kui Wu,Shuhang Xu,Hao Chen,Churan Wang,Zhoujun Li,Yizhou Wang,Fangwei Zhong
発行日	2025-05-28 15:54:19+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

VLM Can Be a Good Assistant: Enhancing Embodied Visual Tracking with Self-Improving Vision-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー