JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse

要約

最近、オープンワールド環境でのアクションベースの意思決定は大きな注目を集めています。
大規模なWebデータセットで事前に処理されたVisual Language Action（VLA）モデルは、意思決定タスクに有望を示しています。
ただし、以前の研究は主にトレーニング後のアクションに焦点を当てており、多くの場合、基礎モデル自体の強化を無視しています。
これに応じて、視覚的な言語モデル（VLM）を視覚的および言語的ガイダンスを通じて洗練された視覚言語からの訓練後の演技から行動する新しいアプローチを紹介します。
この拡張により、オープンワールド環境での世界知識、視覚認識、空間的接地におけるモデルの能力が向上します。
上記のトレーニング後のパラダイムに続いて、クラフト、製錬、調理、採掘、殺害など、1K以上の異なる原子タスクに関する人間の指示に従うことができるMinecraftの最初のVLAモデルを取得します。
私たちの実験は、非指示タスクでのトレーニング後のトレーニングが、原子タスクの多様なセットのベストエージェントベースラインよりも40％の大幅な改善をもたらすことを示しています。
さらに、私たちのアプローチは、Minecraftの従来の模倣学習ベースのポリシーを上回り、最先端のパフォーマンスを達成することを実証しています。
さらなる研究を促進するために、コード、モデル、およびデータセットをオープンソーリングしました。
プロジェクトページは、https：//craftjarvis.github.io/jarvisvlaにあります。

要約(オリジナル)

Recently, action-based decision-making in open-world environments has gained significant attention. Visual Language Action (VLA) models, pretrained on large-scale web datasets, have shown promise in decision-making tasks. However, previous work has primarily focused on action post-training, often neglecting enhancements to the foundational model itself. In response, we introduce a novel approach, Act from Visual Language Post-Training, which refines Visual Language Models (VLMs) through visual and linguistic guidance in a self-supervised manner. This enhancement improves the models’ capabilities in world knowledge, visual recognition, and spatial grounding in open-world environments. Following the above post-training paradigms, we obtain the first VLA models in Minecraft that can follow human instructions on over 1k different atomic tasks, including crafting, smelting, cooking, mining, and killing. Our experiments demonstrate that post-training on non-trajectory tasks leads to a significant 40% improvement over the best agent baseline on a diverse set of atomic tasks. Furthermore, we demonstrate that our approach surpasses traditional imitation learning-based policies in Minecraft, achieving state-of-the-art performance. We have open-sourced the code, models, and datasets to foster further research. The project page can be found in https://craftjarvis.github.io/JarvisVLA.

arxiv情報

著者	Muyao Li,Zihao Wang,Kaichen He,Xiaojian Ma,Yitao Liang
発行日	2025-03-20 17:21:58+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー