Moto: Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos

要約

大規模なコーパスで事前に訓練された大規模な言語モデルの最近の開発は、微調整を最小限に抑えて、さまざまな自然言語処理タスクで大幅に成功しています。
この成功は、アクション標識データの高コストによって長い間制約されてきたロボット工学に新たな約束を提供します。
尋ねます：豊かな「コーパス」として利用可能な相互作用関連の知識を含む豊富なビデオデータを考えると、同様の生成前のトレーニングアプローチを効果的に適用してロボット学習を強化できますか？
重要な課題は、ロボットの操作タスクに役立つ自己回帰の事前トレーニングの効果的な表現を特定することです。
動的な環境を観察することで人間が新しいスキルを学ぶ方法に触発されて、効果的なロボット学習は、低レベルのアクションに密接に結びついており、実際のロボットアクションへの学習運動の移転を促進するモーション関連の知識を強調する必要があることを提案します。
この目的のために、ビデオコンテンツを潜在的な動きのトークナーによって潜在的な動きのトークンシーケンスに変換するMotoを紹介し、監視されていない方法でビデオから動画の「言語」を埋めることを学びます。
モーショントークンの自己網目上を介してモトを前に移動し、多様な視覚的な動きの知識をキャプチャできるようにします。
トレーニング前の後、Moto-GPTは、意味的に解釈可能なモーショントークンを生成し、もっともらしいモーション軌跡を予測し、出力の尤度を通じて軌跡の合理性を評価する有望な能力を実証します。
学習されたモーションプライアーを実際のロボットアクションに転送するために、潜在的なモーショントークン予測と実際のロボット制御をシームレスに橋渡しする共同調整戦略を実装します。
広範な実験では、微調整されたMoto-GPTがロボット操作ベンチマークで優れた堅牢性と効率性を示し、ビデオデータから下流の視覚操作タスクに知識を転送する際の有効性を強調しています。

要約(オリジナル)

Recent developments in Large Language Models pre-trained on extensive corpora have shown significant success in various natural language processing tasks with minimal fine-tuning. This success offers new promise for robotics, which has long been constrained by the high cost of action-labeled data. We ask: given the abundant video data containing interaction-related knowledge available as a rich ‘corpus’, can a similar generative pre-training approach be effectively applied to enhance robot learning? The key challenge is to identify an effective representation for autoregressive pre-training that benefits robot manipulation tasks. Inspired by the way humans learn new skills through observing dynamic environments, we propose that effective robotic learning should emphasize motion-related knowledge, which is closely tied to low-level actions and is hardware-agnostic, facilitating the transfer of learned motions to actual robot actions. To this end, we introduce Moto, which converts video content into latent Motion Token sequences by a Latent Motion Tokenizer, learning a bridging ‘language’ of motion from videos in an unsupervised manner. We pre-train Moto-GPT through motion token autoregression, enabling it to capture diverse visual motion knowledge. After pre-training, Moto-GPT demonstrates the promising ability to produce semantically interpretable motion tokens, predict plausible motion trajectories, and assess trajectory rationality through output likelihood. To transfer learned motion priors to real robot actions, we implement a co-fine-tuning strategy that seamlessly bridges latent motion token prediction and real robot control. Extensive experiments show that the fine-tuned Moto-GPT exhibits superior robustness and efficiency on robot manipulation benchmarks, underscoring its effectiveness in transferring knowledge from video data to downstream visual manipulation tasks.

arxiv情報

著者	Yi Chen,Yuying Ge,Weiliang Tang,Yizhuo Li,Yixiao Ge,Mingyu Ding,Ying Shan,Xihui Liu
発行日	2025-03-21 01:45:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Moto: Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー