Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals

要約

この研究では、新しい拡散ポリシーフレームワークであるマルチモーダル拡散トランスフォーマー (MDT) を紹介します。これは、言語の注釈がほとんどないマルチモーダルな目標仕様から多用途の動作を学習することに優れています。
MDT は、拡散ベースのマルチモーダル変換器バックボーンと 2 つの自己監視型補助目標を活用して、マルチモーダル目標に基づいた長期的な操作タスクを習得します。
模倣学習方法の大多数は、個々の目標様式からのみ学習します。
言語または目標イメージのいずれかです。
しかし、既存の大規模模倣学習データセットには言語アノテーションが部分的にしか付けられていないため、現在の手法ではこれらのデータセットから言語条件付き行動を学習することができません。
MDT は、マルチモーダルなゴール命令で同時にトレーニングされる潜在的なゴール条件付き状態表現を導入することで、この課題に対処します。
この状態表現は、画像と言語ベースの目標埋め込みを調整し、将来の状態を予測するのに十分な情報をエンコードします。
この表現は 2 つの自己監視型補助対物レンズを介してトレーニングされ、提示されたトランスバックボーンのパフォーマンスを向上させます。
MDT は、$2\%$ 未満の言語注釈を含む LIBERO バージョンを含む、難しい CALVIN および LIBERO ベンチマークによって提供される 164 のタスクで優れたパフォーマンスを示します。
さらに、MDT は CALVIN 操作チャレンジで新記録を樹立し、大規模な事前トレーニングが必要で $10\倍$ 多くの学習可能なパラメーターが含まれる従来の最先端の手法と比べて、絶対的なパフォーマンスの向上が $15\%$ であることを実証しました。
MDT は、シミュレーション環境と現実世界の両方の環境で、まばらに注釈が付けられたデータからの長期的な操作を解決する能力を示しています。
デモとコードは https://intuitive-robots.github.io/mdt_policy/ で入手できます。

要約(オリジナル)

This work introduces the Multimodal Diffusion Transformer (MDT), a novel diffusion policy framework, that excels at learning versatile behavior from multimodal goal specifications with few language annotations. MDT leverages a diffusion-based multimodal transformer backbone and two self-supervised auxiliary objectives to master long-horizon manipulation tasks based on multimodal goals. The vast majority of imitation learning methods only learn from individual goal modalities, e.g. either language or goal images. However, existing large-scale imitation learning datasets are only partially labeled with language annotations, which prohibits current methods from learning language conditioned behavior from these datasets. MDT addresses this challenge by introducing a latent goal-conditioned state representation that is simultaneously trained on multimodal goal instructions. This state representation aligns image and language based goal embeddings and encodes sufficient information to predict future states. The representation is trained via two self-supervised auxiliary objectives, enhancing the performance of the presented transformer backbone. MDT shows exceptional performance on 164 tasks provided by the challenging CALVIN and LIBERO benchmarks, including a LIBERO version that contains less than $2\%$ language annotations. Furthermore, MDT establishes a new record on the CALVIN manipulation challenge, demonstrating an absolute performance improvement of $15\%$ over prior state-of-the-art methods that require large-scale pretraining and contain $10\times$ more learnable parameters. MDT shows its ability to solve long-horizon manipulation from sparsely annotated data in both simulated and real-world environments. Demonstrations and Code are available at https://intuitive-robots.github.io/mdt_policy/.

arxiv情報

著者	Moritz Reuss,Ömer Erdinç Yağmurlu,Fabian Wenzel,Rudolf Lioutikov
発行日	2024-07-08 14:46:44+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー