VLMimic: Vision Language Models are Visual Imitation Learner for Fine-grained Actions

要約

視覚模倣学習 (VIL) は、ロボットシステムが新しいスキルを習得するための効率的かつ直感的な戦略を提供します。
ビジョン言語モデル (VLM) の最近の進歩により、VIL タスクの視覚および言語推論機能において顕著なパフォーマンスが実証されました。
進歩にもかかわらず、現在の VIL 手法は単純に VLM を使用して人間のビデオから高レベルの計画を学習し、物理的なインタラクションを実行するために事前定義されたモーションプリミティブに依存していますが、これが依然として大きなボトルネックとなっています。
この研究では、VLM を利用して、限られた数の人間のビデオのみを対象として、詳細なアクションレベルさえも直接学習する新しいパラダイムである VLMimic を紹介します。
具体的には、VLMimic はまず人間のビデオからオブジェクト中心の動きを基礎付け、階層的な制約表現を使用してスキルを学習し、限られた人間のビデオからきめ細かいアクションレベルのスキルの導出を容易にします。
これらのスキルは、反復的な比較戦略を通じて洗練および更新され、目に見えない環境への効率的な適応を可能にします。
私たちの広範な実験により、VLMimic は 5 つの人間のビデオのみを使用して、RLBench および現実世界の操作タスクで 27% 以上の大幅な改善をもたらし、長期タスクではベースラインを 37% 以上上回ることが示されました。

要約(オリジナル)

Visual imitation learning (VIL) provides an efficient and intuitive strategy for robotic systems to acquire novel skills. Recent advancements in Vision Language Models (VLMs) have demonstrated remarkable performance in vision and language reasoning capabilities for VIL tasks. Despite the progress, current VIL methods naively employ VLMs to learn high-level plans from human videos, relying on pre-defined motion primitives for executing physical interactions, which remains a major bottleneck. In this work, we present VLMimic, a novel paradigm that harnesses VLMs to directly learn even fine-grained action levels, only given a limited number of human videos. Specifically, VLMimic first grounds object-centric movements from human videos, and learns skills using hierarchical constraint representations, facilitating the derivation of skills with fine-grained action levels from limited human videos. These skills are refined and updated through an iterative comparison strategy, enabling efficient adaptation to unseen environments. Our extensive experiments exhibit that our VLMimic, using only 5 human videos, yields significant improvements of over 27% and 21% in RLBench and real-world manipulation tasks, and surpasses baselines by over 37% in long-horizon tasks.

arxiv情報

著者	Guanyan Chen,Meiling Wang,Te Cui,Yao Mu,Haoyang Lu,Tianxing Zhou,Zicai Peng,Mengxiao Hu,Haizhou Li,Yuan Li,Yi Yang,Yufeng Yue
発行日	2024-10-29 13:03:10+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

VLMimic: Vision Language Models are Visual Imitation Learner for Fine-grained Actions

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー