Bridging Language, Vision and Action: Multimodal VAEs in Robotic Manipulation Tasks

要約

この研究では、ロボット操作の分野における教師なしの視覚-言語-行動マッピングに焦点を当てます。
最近、このタスクに対して、事前にトレーニングされた大規模な言語モデルと視覚モデルを使用する複数のアプローチが提案されています。
ただし、計算量が多く、生成される出力を慎重に微調整する必要があります。
より軽量な代替案は、データの潜在的な特徴を抽出し、それらを結合表現に統合できるマルチモーダル変分オートエンコーダ (VAE) の実装です。これは主に状態の画像-画像または画像-テキストデータで実証されています。
最先端のモデル。
ここでは、シミュレートされた環境での教師なしロボット操作タスクにマルチモーダル VAE を使用できるかどうか、またその方法を検討します。
得られた結果に基づいて、シミュレーターでのモデルのパフォーマンスを最大 55% 向上させるモデル不変トレーニングの代替案を提案します。
さらに、物体やロボットの位置のばらつき、邪魔者の数、タスクの長さなど、個々のタスクによって生じる課題を体系的に評価します。
したがって、私たちの研究は、視覚と言語に基づいたロボットの動作軌道の教師なし学習に現在のマルチモーダル VAE を使用することの潜在的な利点と限界も明らかにしています。

要約(オリジナル)

In this work, we focus on unsupervised vision-language-action mapping in the area of robotic manipulation. Recently, multiple approaches employing pre-trained large language and vision models have been proposed for this task. However, they are computationally demanding and require careful fine-tuning of the produced outputs. A more lightweight alternative would be the implementation of multimodal Variational Autoencoders (VAEs) which can extract the latent features of the data and integrate them into a joint representation, as has been demonstrated mostly on image-image or image-text data for the state-of-the-art models. Here we explore whether and how can multimodal VAEs be employed in unsupervised robotic manipulation tasks in a simulated environment. Based on the obtained results, we propose a model-invariant training alternative that improves the models’ performance in a simulator by up to 55%. Moreover, we systematically evaluate the challenges raised by the individual tasks such as object or robot position variability, number of distractors or the task length. Our work thus also sheds light on the potential benefits and limitations of using the current multimodal VAEs for unsupervised learning of robotic motion trajectories based on vision and language.

arxiv情報

著者	Gabriela Sejnova,Michal Vavrecka,Karla Stepanova
発行日	2024-04-02 13:25:16+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Bridging Language, Vision and Action: Multimodal VAEs in Robotic Manipulation Tasks

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー