A Survey on Vision-Language-Action Models for Embodied AI

要約

身体化AIは、身体化されたエージェントを制御して物理世界でタスクを実行することから、人工知能の重要な要素として広く認識されている。大規模な言語モデルと視覚言語モデルの成功に基づき、視覚言語行動モデル（VLA）と呼ばれる新しいカテゴリのマルチモーダルモデルが登場し、行動を生成する明確な能力を活用することで、体現型AIにおける言語条件付きロボットタスクに対処している。近年、無数のVLAが開発されており、包括的な調査を通じて急速に進化する状況を把握することが急務となっている。この目的のために、我々は具現化AIのためのVLAに関する最初のサーベイを発表する。本研究では、VLAの詳細な分類法を提供し、3つの主要な研究ラインに整理する。第一のラインは、VLAの個々のコンポーネントに焦点を当てている。第二のラインは、低レベルの行動を予測することに長けた制御ポリシーの開発に特化している。第3のラインは、ロングホライズンタスクをサブタスクのシーケンスに分解し、それによってVLAをより一般的なユーザーの指示に従わせることができる高レベルタスクプランナーである。さらに、データセット、シミュレータ、ベンチマークを含む関連リソースの広範な要約を提供する。最後に、VLAが直面する課題を議論し、具現化AIにおける有望な将来の方向性を概説する。

要約(オリジナル)

Embodied AI is widely recognized as a key element of artificial general intelligence because it involves controlling embodied agents to perform tasks in the physical world. Building on the success of large language models and vision-language models, a new category of multimodal models — referred to as vision-language-action models (VLAs) — has emerged to address language-conditioned robotic tasks in embodied AI by leveraging their distinct ability to generate actions. In recent years, a myriad of VLAs have been developed, making it imperative to capture the rapidly evolving landscape through a comprehensive survey. To this end, we present the first survey on VLAs for embodied AI. This work provides a detailed taxonomy of VLAs, organized into three major lines of research. The first line focuses on individual components of VLAs. The second line is dedicated to developing control policies adept at predicting low-level actions. The third line comprises high-level task planners capable of decomposing long-horizon tasks into a sequence of subtasks, thereby guiding VLAs to follow more general user instructions. Furthermore, we provide an extensive summary of relevant resources, including datasets, simulators, and benchmarks. Finally, we discuss the challenges faced by VLAs and outline promising future directions in embodied AI.

arxiv情報

著者	Yueen Ma,Zixing Song,Yuzheng Zhuang,Jianye Hao,Irwin King
発行日	2025-03-03 03:19:31+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

A Survey on Vision-Language-Action Models for Embodied AI

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー