LLaRA: Supercharging Robot Learning Data for Vision-Language Policy

要約

ビジョン言語モデル（VLM）は最近、ロボットアクションを生成するために活用され、ビジョン言語アクション（VLA）モデルを形成しています。
ただし、特に限られた数のロボットデモンストレーションによって制約されている場合、ロボット制御のために前処理されたVLMを直接適合させることは依然として困難です。
この作業では、Llara：Lage Language and Robotics Assistantを紹介します。これは、ロボットアクションポリシーをVisuo-Textual会話として策定し、事前に処理されたVLMを強力なVLAに効率的に転送することを可能にします。
ビジョン。
まず、自動化されたパイプラインを提示して、既存の動作クローニングデータセットからロボットの会話スタイルの命令チューニングデータを生成し、ロボットアクションを画像ピクセルコーディネートに合わせます。
さらに、追加のアクションアノテーションを必要とせずに、6つの補助タスクを定義することにより、このデータセットを自己監視方法で強化します。
限られた量のそのようなデータセットでFinetunedを使用すると、ロボット制御のために意味のあるアクション決定が生じる可能性があることを示します。
複数のシミュレートされた現実世界のタスクにわたる実験を通じて、Llaraが大規模な言語モデルの一般化機能を維持しながら、最先端のパフォーマンスを達成することを実証します。
コード、データセット、および前処理されたモデルは、https：//github.com/lostxine/llaraで入手できます。

要約(オリジナル)

Vision Language Models (VLMs) have recently been leveraged to generate robotic actions, forming Vision-Language-Action (VLA) models. However, directly adapting a pretrained VLM for robotic control remains challenging, particularly when constrained by a limited number of robot demonstrations. In this work, we introduce LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as visuo-textual conversations and enables an efficient transfer of a pretrained VLM into a powerful VLA, motivated by the success of visual instruction tuning in Computer Vision. First, we present an automated pipeline to generate conversation-style instruction tuning data for robots from existing behavior cloning datasets, aligning robotic actions with image pixel coordinates. Further, we enhance this dataset in a self-supervised manner by defining six auxiliary tasks, without requiring any additional action annotations. We show that a VLM finetuned with a limited amount of such datasets can produce meaningful action decisions for robotic control. Through experiments across multiple simulated and real-world tasks, we demonstrate that LLaRA achieves state-of-the-art performance while preserving the generalization capabilities of large language models. The code, datasets, and pretrained models are available at https://github.com/LostXine/LLaRA.

arxiv情報

著者	Xiang Li,Cristina Mata,Jongwoo Park,Kumara Kahatapitiya,Yoo Sung Jang,Jinghuan Shang,Kanchana Ranasinghe,Ryan Burgert,Mu Cai,Yong Jae Lee,Michael S. Ryoo
発行日	2025-01-30 17:34:37+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

LLaRA: Supercharging Robot Learning Data for Vision-Language Policy

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー