JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models

要約

オープンワールドでマルチモーダルな観察による人間のような計画と制御を達成することは、より機能的なジェネラリストエージェントにとって重要なマイルストーンです。
既存のアプローチは、オープンワールドで長期にわたる特定のタスクを処理できます。
ただし、オープンワールドのタスクの数が潜在的に無限になる可能性があり、ゲーム時間の経過とともにタスクの完了を段階的に向上させる機能がない場合、依然として苦労しています。
JARVIS-1 は、人気がありながらも挑戦的なオープンワールドの Minecraft ユニバース内で、マルチモーダルな入力 (視覚的観察と人間の指示) を認識し、洗練された計画を生成し、具体的な制御を実行できるオープンワールドエージェントです。
具体的には、事前にトレーニングされたマルチモーダル言語モデルに基づいて JARVIS-1 を開発し、視覚的な観察とテキストによる指示を計画にマッピングします。
計画は最終的に目標条件付きコントローラーに送信されます。
私たちは JARVIS-1 にマルチモーダルメモリを装備し、事前に訓練された知識と実際のゲームサバイバル体験の両方を使用した計画を容易にします。
私たちの実験では、JARVIS-1 は、入門レベルから中級レベルまで、Minecraft Universe ベンチマークの 200 以上のさまざまなタスクにわたってほぼ完璧なパフォーマンスを示しました。
JARVIS-1 は、長距離ダイヤモンドつるはしタスクで 12.5% の完了率を達成しました。
これは、以前の記録と比較して最大 5 倍という大幅な増加を表します。
さらに、JARVIS-1 はマルチモーダル記憶のおかげで生涯学習パラダイムに従って $\textit{自己改善}$ でき、より一般的な知性と改善された自律性を呼び起こすことができることを示します。
プロジェクトページは https://craftjarvis-jarvis1.github.io で利用できます。

要約(オリジナル)

Achieving human-like planning and control with multimodal observations in an open world is a key milestone for more functional generalist agents. Existing approaches can handle certain long-horizon tasks in an open world. However, they still struggle when the number of open-world tasks could potentially be infinite and lack the capability to progressively enhance task completion as game time progresses. We introduce JARVIS-1, an open-world agent that can perceive multimodal input (visual observations and human instructions), generate sophisticated plans, and perform embodied control, all within the popular yet challenging open-world Minecraft universe. Specifically, we develop JARVIS-1 on top of pre-trained multimodal language models, which map visual observations and textual instructions to plans. The plans will be ultimately dispatched to the goal-conditioned controllers. We outfit JARVIS-1 with a multimodal memory, which facilitates planning using both pre-trained knowledge and its actual game survival experiences. In our experiments, JARVIS-1 exhibits nearly perfect performances across over 200 varying tasks from the Minecraft Universe Benchmark, ranging from entry to intermediate levels. JARVIS-1 has achieved a completion rate of 12.5% in the long-horizon diamond pickaxe task. This represents a significant increase up to 5 times compared to previous records. Furthermore, we show that JARVIS-1 is able to $\textit{self-improve}$ following a life-long learning paradigm thanks to multimodal memory, sparking a more general intelligence and improved autonomy. The project page is available at https://craftjarvis-jarvis1.github.io.

arxiv情報

著者	Zihao Wang,Shaofei Cai,Anji Liu,Yonggang Jin,Jinbing Hou,Bowei Zhang,Haowei Lin,Zhaofeng He,Zilong Zheng,Yaodong Yang,Xiaojian Ma,Yitao Liang
発行日	2023-11-10 11:17:58+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー