LLaRA: Supercharging Robot Learning Data for Vision-Language Policy

要約

広範な世界知識と強力な推論スキルを備えた大規模言語モデル (LLM) は、多くの場合、タスクを会話形式の命令と応答のペアとして扱うことで、領域を超えた多様なタスクに取り組むことができます。
この論文では、ロボットの動作ポリシーを会話として定式化し、ポリシー学習を補完する補助データを使用してトレーニングすると応答が向上するフレームワークである LLaRA: Large Language and Robotics Assistant を提案します。
視覚的な入力を備えた LLM、つまりビジョン言語モデル (VLM) は、状態情報を視覚的テキストプロンプトとして処理し、最適なポリシー決定をテキストで生成する機能を備えています。
このようなアクションポリシー VLM をトレーニングするには、まず、既存の動作クローンデータからさまざまな高品質のロボット工学指示データを生成する自動パイプラインを導入します。
ロボットタスクに合わせて調整された会話形式の定式化に基づいて、結果として得られるデータセットのコレクションで微調整された VLM は、意味のあるロボット動作ポリシーの決定を生成できます。
複数のシミュレートされた現実世界の環境にわたる私たちの実験は、提案された LLaRA フレームワークの最先端のパフォーマンスを実証します。
コード、データセット、および事前トレーニングされたモデルは、https://github.com/LostXine/LLaRA で入手できます。

要約(オリジナル)

Large Language Models (LLMs) equipped with extensive world knowledge and strong reasoning skills can tackle diverse tasks across domains, often by posing them as conversation-style instruction-response pairs. In this paper, we propose LLaRA: Large Language and Robotics Assistant, a framework which formulates robot action policy as conversations, and provides improved responses when trained with auxiliary data that complements policy learning. LLMs with visual inputs, i.e., Vision Language Models (VLMs), have the capacity to process state information as visual-textual prompts and generate optimal policy decisions in text. To train such action policy VLMs, we first introduce an automated pipeline to generate diverse high-quality robotics instruction data from existing behavior cloning data. A VLM finetuned with the resulting collection of datasets based on a conversation-style formulation tailored for robotics tasks, can generate meaningful robot action policy decisions. Our experiments across multiple simulated and real-world environments demonstrate the state-of-the-art performance of the proposed LLaRA framework. The code, datasets, and pretrained models are available at https://github.com/LostXine/LLaRA.

arxiv情報

著者	Xiang Li,Cristina Mata,Jongwoo Park,Kumara Kahatapitiya,Yoo Sung Jang,Jinghuan Shang,Kanchana Ranasinghe,Ryan Burgert,Mu Cai,Yong Jae Lee,Michael S. Ryoo
発行日	2024-06-28 17:59:12+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

LLaRA: Supercharging Robot Learning Data for Vision-Language Policy

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー