LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning

要約

近年、命令調整されたラージマルチモーダルモデル (LMM) は、画像キャプションや視覚的な質問応答などのいくつかのタスクで成功しています。
しかし、これらのモデルを活用することはロボット工学にとって未解決の問題のままです。
ロボット工学アプリケーション用のこれまでの LMM は、言語および動作データに基づいて広範囲にトレーニングされてきましたが、さまざまな設定で一般化する能力が期待ほどではないことがよくありました。
これに対処するために、構造化されたプロンプトを活用してさまざまなロボット学習タスク、シナリオ、環境を統合する新しい命令調整方法でトレーニングされたモデルである LLARVA を導入します。
さらに、「視覚トレース」と呼ぶ中間 2 次元表現を予測することで、ロボット学習のための視覚空間と動作空間をさらに調整するのに役立つことを示します。
モデルを事前トレーニングするために、Open X-Embodiment データセットから 850 万の画像と視覚のトレースペアを生成し、RLBench シミュレーターと物理的な Franka Emika Panda 7-DoF ロボットで 12 の異なるタスクを評価しました。
私たちの実験では強力なパフォーマンスが得られ、2D 表現と言語表現を使用した LLARVA がいくつかの現代のベースラインと比較して優れたパフォーマンスを発揮し、さまざまなロボット環境や構成にわたって一般化できることが実証されました。

要約(オリジナル)

In recent years, instruction-tuned Large Multimodal Models (LMMs) have been successful at several tasks, including image captioning and visual question answering; yet leveraging these models remains an open question for robotics. Prior LMMs for robotics applications have been extensively trained on language and action data, but their ability to generalize in different settings has often been less than desired. To address this, we introduce LLARVA, a model trained with a novel instruction tuning method that leverages structured prompts to unify a range of robotic learning tasks, scenarios, and environments. Additionally, we show that predicting intermediate 2-D representations, which we refer to as ‘visual traces’, can help further align vision and action spaces for robot learning. We generate 8.5M image-visual trace pairs from the Open X-Embodiment dataset in order to pre-train our model, and we evaluate on 12 different tasks in the RLBench simulator as well as a physical Franka Emika Panda 7-DoF robot. Our experiments yield strong performance, demonstrating that LLARVA – using 2-D and language representations – performs well compared to several contemporary baselines, and can generalize across various robot environments and configurations.

arxiv情報

著者	Dantong Niu,Yuvan Sharma,Giscard Biamby,Jerome Quenum,Yutong Bai,Baifeng Shi,Trevor Darrell,Roei Herzig
発行日	2024-06-17 17:55:29+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー