Grounding Video Models to Actions through Goal Conditioned Exploration

要約

大量のインターネットビデオで事前に処理された大規模なビデオモデルは、オブジェクトとタスクのダイナミクスと動きに関する物理的知識の豊富なソースを提供します。
ただし、ビデオモデルはエージェントの具体化に基づいていないため、ビデオに描かれている視覚状態に到達するために世界を作動させる方法を説明していません。
この問題に取り組むために、現在の方法は、具体化固有のデータで訓練された別のビジョンベースの逆動的モデルを使用して、画像状態をアクションにマッピングします。
このようなモデルをトレーニングするためにデータを収集することは、多くの場合、高価で挑戦的であり、このモデルはデータが利用可能なものと同様の視覚設定に限定されます。
この論文では、生成されたビデオ状態を探索の視覚的目標として使用して、具体化された環境での自己探求を通じて、ビデオモデルを継続的なアクションに直接接地する方法を調査します。
ビデオガイダンスと組み合わせて軌道レベルのアクション生成を使用して、エージェントが外部の監督、報酬、アクションラベル、セグメンテーションマスクなしで複雑なタスクを解決できるようにするフレームワークを提案します。
Liberoの8つのタスク、Metaworldの6つのタスク、Calvinの4つのタスク、およびIthor Visual Navigationの12タスクで提案されたアプローチを検証します。
私たちのアプローチがどのように同様であるかを示します。

要約(オリジナル)

Large video models, pretrained on massive amounts of Internet video, provide a rich source of physical knowledge about the dynamics and motions of objects and tasks. However, video models are not grounded in the embodiment of an agent, and do not describe how to actuate the world to reach the visual states depicted in a video. To tackle this problem, current methods use a separate vision-based inverse dynamic model trained on embodiment-specific data to map image states to actions. Gathering data to train such a model is often expensive and challenging, and this model is limited to visual settings similar to the ones in which data are available. In this paper, we investigate how to directly ground video models to continuous actions through self-exploration in the embodied environment — using generated video states as visual goals for exploration. We propose a framework that uses trajectory level action generation in combination with video guidance to enable an agent to solve complex tasks without any external supervision, e.g., rewards, action labels, or segmentation masks. We validate the proposed approach on 8 tasks in Libero, 6 tasks in MetaWorld, 4 tasks in Calvin, and 12 tasks in iThor Visual Navigation. We show how our approach is on par with or even surpasses multiple behavior cloning baselines trained on expert demonstrations while without requiring any action annotations.

arxiv情報

著者	Yunhao Luo,Yilun Du
発行日	2025-03-12 17:03:25+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Grounding Video Models to Actions through Goal Conditioned Exploration

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー