Empowering Embodied Visual Tracking with Visual Foundation Models and Offline RL

要約

身体的視覚追跡とは、エージェントの自己中心的な視覚を使用して、動的 3D 環境でターゲットオブジェクトを追跡することです。
これは、実体化したエージェントにとって重要かつ困難なスキルです。
ただし、既存の方法には非効率なトレーニングと貧弱な一般化という問題があります。
この論文では、視覚基盤モデル(VFM)とオフライン強化学習(オフラインRL)を組み合わせて、身体化された視覚追跡を強化する新しいフレームワークを提案します。
「Tracking Anything」などの事前トレーニング済み VFM を使用して、テキストプロンプトでセマンティックセグメンテーションマスクを抽出します。
次に、オンライン対話なしで収集されたデモンストレーションから学習するために、オフライン RL (保守的 Q ラーニングなど) を使用してリカレントポリシーネットワークをトレーニングします。
ポリシーネットワークの堅牢性と汎用性をさらに向上させるために、マスクの再ターゲットメカニズムとマルチレベルのデータ収集戦略も導入します。
このようにして、消費者レベルの GPU (Nvidia RTX 3090 など) で堅牢なポリシーを 1 時間以内にトレーニングできます。注意散漫やオクルージョンなどの困難な状況を伴ういくつかの高忠実度環境でエージェントを評価します。
結果は、私たちのエージェントが、サンプル効率、邪魔者に対する堅牢性、そしてまだ見ぬシナリオやターゲットへの一般化の点で、最先端の手法を上回っていることを示しています。
また、学習されたエージェントが仮想環境から現実世界のロボットに移行できることも実証します。

要約(オリジナル)

Embodied visual tracking is to follow a target object in dynamic 3D environments using an agent’s egocentric vision. This is a vital and challenging skill for embodied agents. However, existing methods suffer from inefficient training and poor generalization. In this paper, we propose a novel framework that combines visual foundation models(VFM) and offline reinforcement learning(offline RL) to empower embodied visual tracking. We use a pre-trained VFM, such as ‘Tracking Anything’, to extract semantic segmentation masks with text prompts. We then train a recurrent policy network with offline RL, e.g., Conservative Q-Learning, to learn from the collected demonstrations without online interactions. To further improve the robustness and generalization of the policy network, we also introduce a mask re-targeting mechanism and a multi-level data collection strategy. In this way, we can train a robust policy within an hour on a consumer-level GPU, e.g., Nvidia RTX 3090. We evaluate our agent on several high-fidelity environments with challenging situations, such as distraction and occlusion. The results show that our agent outperforms state-of-the-art methods in terms of sample efficiency, robustness to distractors, and generalization to unseen scenarios and targets. We also demonstrate the transferability of the learned agent from virtual environments to a real-world robot.

arxiv情報

著者	Fangwei Zhong,Kui Wu,Hai Ci,Churan Wang,Hao Chen
発行日	2024-07-22 06:13:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Empowering Embodied Visual Tracking with Visual Foundation Models and Offline RL

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー