OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents

要約

この論文では、Minecraft のオープンワールドの命令に従うエージェントのための新しいビジョン言語アクション (VLA) モデルである OmniJARVIS について説明します。
別個のコントローラーにテキストの目標を出力したり、制御コマンドを直接生成したりする従来の作業と比較して、OmniJARVIS は、マルチモーダルインタラクションデータの統一されたトークン化を通じて、強力な推論と効率的な意思決定機能の両方を確保するための異なる道を模索しています。
まず、行動軌跡 $\tau = \{o_0, a_0, \dots\}$ の離散化トークンを生成する行動エンコーダーと、これらのトークンを条件とする模倣学習ポリシーデコーダーを学習するための自己教師ありアプローチを導入します。
これらの追加の動作トークンは、事前トレーニングされたマルチモーダル言語モデルの語彙に拡張されます。
このエンコーダーを使用すると、タスクの指示、記憶、思考、観察、テキスト応答、行動軌跡などを含む長期にわたるマルチモーダルなインタラクションを統一されたトークンシーケンスにパックし、自己回帰トランスフォーマーでモデル化します。
意味的に意味のある動作トークンのおかげで、結果として得られる VLA モデルである OmniJARVIS は、(思考連鎖を生成することによって) 推論し、計画し、質問に答え、(模倣学習ポリシーデコーダー用の動作トークンを生成することによって) 行動することができます。
OmniJARVIS は、オープンワールド Minecraft のアトミック、プログラム、オープンエンドのタスクの包括的なコレクションで優れたパフォーマンスを示します。
私たちの分析により、インタラクションデータの形成、統一されたトークン化、およびそのスケーリングの可能性における重要な設計原則がさらに明らかになります。
データセット、モデル、コードは https://craftjarvis.org/OmniJARVIS でリリースされます。

要約(オリジナル)

This paper presents OmniJARVIS, a novel Vision-Language-Action (VLA) model for open-world instruction-following agents in Minecraft. Compared to prior works that either emit textual goals to separate controllers or produce the control command directly, OmniJARVIS seeks a different path to ensure both strong reasoning and efficient decision-making capabilities via unified tokenization of multimodal interaction data. First, we introduce a self-supervised approach to learn a behavior encoder that produces discretized tokens for behavior trajectories $\tau = \{o_0, a_0, \dots\}$ and an imitation learning policy decoder conditioned on these tokens. These additional behavior tokens will be augmented to the vocabulary of pretrained Multimodal Language Models. With this encoder, we then pack long-term multimodal interactions involving task instructions, memories, thoughts, observations, textual responses, behavior trajectories, etc into unified token sequences and model them with autoregressive transformers. Thanks to the semantically meaningful behavior tokens, the resulting VLA model, OmniJARVIS, can reason (by producing chain-of-thoughts), plan, answer questions, and act (by producing behavior tokens for the imitation learning policy decoder). OmniJARVIS demonstrates excellent performances on a comprehensive collection of atomic, programmatic, and open-ended tasks in open-world Minecraft. Our analysis further unveils the crucial design principles in interaction data formation, unified tokenization, and its scaling potentials. The dataset, models, and code will be released at https://craftjarvis.org/OmniJARVIS.

arxiv情報

著者	Zihao Wang,Shaofei Cai,Zhancun Mu,Haowei Lin,Ceyao Zhang,Xuejie Liu,Qing Li,Anji Liu,Xiaojian Ma,Yitao Liang
発行日	2024-10-31 14:27:50+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー