TrackVLA: Embodied Visual Tracking in the Wild

要約

具体化された視覚追跡は、具体化されたAIの基本的なスキルであり、エージェントがエゴセントリックビジョンのみを使用して動的環境で特定のターゲットに従うことができます。
このタスクは、正確なターゲット認識と、重度の閉塞と高いシーンのダイナミクスの条件下での効果的な軌道計画の両方を必要とするため、本質的に挑戦的です。
既存のアプローチは、通常、認識と計画のモジュール式分離を通じてこの課題に対処します。
この作業では、オブジェクト認識と軌跡計画の相乗効果を学習するVision-Language-action（VLA）モデルであるTrackVLAを提案します。
共有LLMバックボーンを活用して、認識のために言語モデリングヘッドと軌道計画のためのアンカーベースの拡散モデルを採用しています。
TrackVLAを訓練するために、具体化された視覚追跡ベンチマーク（EVTベンチ）を構築し、多様な難易度の認識サンプルを収集して、170万サンプルのデータセットになります。
合成環境と現実世界の両方の環境での広範な実験を通じて、TrackVLAはSOTAパフォーマンスと強力な一般化可能性を示しています。
パブリックベンチマークで既存のメソッドをゼロショット方法で大幅に上回り、10 fpsの推論速度で実際のシナリオでの高いダイナミクスと閉塞に堅牢であり続けます。
プロジェクトのページは、https：//pku-epic.github.io/trackvla-webです。

要約(オリジナル)

Embodied visual tracking is a fundamental skill in Embodied AI, enabling an agent to follow a specific target in dynamic environments using only egocentric vision. This task is inherently challenging as it requires both accurate target recognition and effective trajectory planning under conditions of severe occlusion and high scene dynamics. Existing approaches typically address this challenge through a modular separation of recognition and planning. In this work, we propose TrackVLA, a Vision-Language-Action (VLA) model that learns the synergy between object recognition and trajectory planning. Leveraging a shared LLM backbone, we employ a language modeling head for recognition and an anchor-based diffusion model for trajectory planning. To train TrackVLA, we construct an Embodied Visual Tracking Benchmark (EVT-Bench) and collect diverse difficulty levels of recognition samples, resulting in a dataset of 1.7 million samples. Through extensive experiments in both synthetic and real-world environments, TrackVLA demonstrates SOTA performance and strong generalizability. It significantly outperforms existing methods on public benchmarks in a zero-shot manner while remaining robust to high dynamics and occlusion in real-world scenarios at 10 FPS inference speed. Our project page is: https://pku-epic.github.io/TrackVLA-web.

arxiv情報

著者	Shaoan Wang,Jiazhao Zhang,Minghan Li,Jiahang Liu,Anqi Li,Kui Wu,Fangwei Zhong,Junzhi Yu,Zhizheng Zhang,He Wang
発行日	2025-05-29 07:28:09+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

TrackVLA: Embodied Visual Tracking in the Wild

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー