3D-VLA: A 3D Vision-Language-Action Generative World Model

要約

最近のビジョン言語アクション (VLA) モデルは 2D 入力に依存しており、3D 物理世界のより広い領域との統合が不足しています。
さらに、世界の広大な力学や行動と力学の関係を無視して、知覚から行動への直接的なマッピングを学習することによって行動予測を実行します。
対照的に、人間には、将来のシナリオについて想像力を描き、それに応じて行動を計画する世界モデルが与えられています。
この目的を達成するために、生成世界モデルを通じて 3D の知覚、推論、およびアクションをシームレスにリンクする具体化された基礎モデルの新しいファミリーを導入することにより、3D-VLA を提案します。
具体的には、3D-VLA は 3D ベースの大規模言語モデル (LLM) 上に構築され、具体化された環境と連携するために一連の対話トークンが導入されます。
さらに、生成機能をモデルに注入するために、一連の具体化された拡散モデルをトレーニングし、ゴール画像と点群を予測するためにそれらを LLM に調整します。
3D-VLA をトレーニングするために、既存のロボットデータセットから膨大な 3D 関連情報を抽出して、大規模な 3D 具現化命令データセットを厳選します。
保持されたデータセットに対する私たちの実験では、3D-VLA が具体化された環境における推論、マルチモーダル生成、および計画能力を大幅に向上させ、現実世界のアプリケーションでの可能性を示していることが実証されました。

要約(オリジナル)

Recent vision-language-action (VLA) models rely on 2D inputs, lacking integration with the broader realm of the 3D physical world. Furthermore, they perform action prediction by learning a direct mapping from perception to action, neglecting the vast dynamics of the world and the relations between actions and dynamics. In contrast, human beings are endowed with world models that depict imagination about future scenarios to plan actions accordingly. To this end, we propose 3D-VLA by introducing a new family of embodied foundation models that seamlessly link 3D perception, reasoning, and action through a generative world model. Specifically, 3D-VLA is built on top of a 3D-based large language model (LLM), and a set of interaction tokens is introduced to engage with the embodied environment. Furthermore, to inject generation abilities into the model, we train a series of embodied diffusion models and align them into the LLM for predicting the goal images and point clouds. To train our 3D-VLA, we curate a large-scale 3D embodied instruction dataset by extracting vast 3D-related information from existing robotics datasets. Our experiments on held-in datasets demonstrate that 3D-VLA significantly improves the reasoning, multimodal generation, and planning capabilities in embodied environments, showcasing its potential in real-world applications.

arxiv情報

著者	Haoyu Zhen,Xiaowen Qiu,Peihao Chen,Jincheng Yang,Xin Yan,Yilun Du,Yining Hong,Chuang Gan
発行日	2024-03-14 17:58:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

3D-VLA: A 3D Vision-Language-Action Generative World Model

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー