Instruction-Following Agents with Jointly Pre-Trained Vision-Language Models

要約

人間は、幅広いタスクを達成するために言語と視覚を理解することに優れています。
対照的に、一般的な指示に従う具体化されたエージェントを作成することは、依然として困難な課題です。
純粋な言語のみのモデルを使用する以前の研究では、視覚的な根拠が欠けているため、言語の指示を視覚的な観察と結びつけることが困難です。
一方、事前にトレーニングされた視覚言語モデルを使用する方法では、通常、言語と視覚表現が分割されており、それらを融合させるために特殊なネットワークアーキテクチャを設計する必要があります。
ロボットが視覚ベースの環境で命令に従うタスクを解決するためのシンプルで効果的なモデルを提案します。
私たちの \ours メソッドは、視覚的な観察と言語の指示をエンコードするマルチモーダルトランスフォーマーと、エンコードされた表現に基づいてアクションを予測するポリシートランスフォーマーで構成されます。
マルチモーダルトランスフォーマーは、何百万もの画像とテキストのペアと自然言語テキストで事前にトレーニングされているため、観察と指示の一般的なクロスモーダル表現が生成されます。
ポリシートランスフォーマーは、観測とアクションの完全な履歴を追跡し、アクションを自己回帰的に予測します。
この統合されたトランスフォーマーモデルは、シングルタスク設定とマルチタスク設定の両方で、最先端の事前トレーニング済みまたはゼロからトレーニング済みのすべての方法よりも優れていることを示しています。
私たちのモデルは、以前の研究よりも優れたモデルのスケーラビリティと一般化能力も示しています。

要約(オリジナル)

Humans are excellent at understanding language and vision to accomplish a wide range of tasks. In contrast, creating general instruction-following embodied agents remains a difficult challenge. Prior work that uses pure language-only models lack visual grounding, making it difficult to connect language instructions with visual observations. On the other hand, methods that use pre-trained vision-language models typically come with divided language and visual representations, requiring designing specialized network architecture to fuse them together. We propose a simple yet effective model for robots to solve instruction-following tasks in vision-based environments. Our \ours method consists of a multimodal transformer that encodes visual observations and language instructions, and a policy transformer that predicts actions based on encoded representations. The multimodal transformer is pre-trained on millions of image-text pairs and natural language text, thereby producing generic cross-modal representations of observations and instructions. The policy transformer keeps track of the full history of observations and actions, and predicts actions autoregressively. We show that this unified transformer model outperforms all state-of-the-art pre-trained or trained-from-scratch methods in both single-task and multi-task settings. Our model also shows better model scalability and generalization ability than prior work.

arxiv情報

著者	Hao Liu,Lisa Lee,Kimin Lee,Pieter Abbeel
発行日	2022-10-24 17:46:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Instruction-Following Agents with Jointly Pre-Trained Vision-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー