A Survey on Vision-Language-Action Models for Embodied AI

要約

具体化されたAIは、物理世界でタスクを実行するために具体化されたエージェントを制御することを伴うため、人工的な一般情報の重要な要素として広く認識されています。
大規模な言語モデルとビジョン言語モデルの成功に基づいて、ビジョン言語アクションモデル（VLA）と呼ばれるマルチモーダルモデルの新しいカテゴリである – は、アクションを生成するための異なる能力を活用することにより、AIを具体化する言語条件付きロボットタスクに対処するために登場しました。
近年、無数のVLAが開発されており、包括的な調査を通じて急速に進化する景観を捉えることが不可欠です。
この目的のために、具体化されたAIのVLAに関する最初の調査を紹介します。
この作業は、VLAの詳細な分類法を提供し、3つの主要な研究ラインに組織されています。
最初の行は、VLAの個々のコンポーネントに焦点を当てています。
2行目は、低レベルのアクションの予測に熟知した制御ポリシーの開発に専念しています。
3番目の行は、長距離タスクを一連のサブタスクに分解できる高レベルのタスクプランナーで構成されているため、より一般的なユーザーの指示に従うようにVLAを導きます。
さらに、データセット、シミュレータ、ベンチマークなど、関連するリソースの広範な要約を提供します。
最後に、VLAが直面する課題について説明し、具体化されたAIの将来の方向性を有望で概説します。
https://github.com/yueen-ma/awesome-vlaで入手可能なこの調査に関連するプロジェクトを作成しました。

要約(オリジナル)

Embodied AI is widely recognized as a key element of artificial general intelligence because it involves controlling embodied agents to perform tasks in the physical world. Building on the success of large language models and vision-language models, a new category of multimodal models — referred to as vision-language-action models (VLAs) — has emerged to address language-conditioned robotic tasks in embodied AI by leveraging their distinct ability to generate actions. In recent years, a myriad of VLAs have been developed, making it imperative to capture the rapidly evolving landscape through a comprehensive survey. To this end, we present the first survey on VLAs for embodied AI. This work provides a detailed taxonomy of VLAs, organized into three major lines of research. The first line focuses on individual components of VLAs. The second line is dedicated to developing control policies adept at predicting low-level actions. The third line comprises high-level task planners capable of decomposing long-horizon tasks into a sequence of subtasks, thereby guiding VLAs to follow more general user instructions. Furthermore, we provide an extensive summary of relevant resources, including datasets, simulators, and benchmarks. Finally, we discuss the challenges faced by VLAs and outline promising future directions in embodied AI. We have created a project associated with this survey, which is available at https://github.com/yueen-ma/Awesome-VLA.

arxiv情報

著者	Yueen Ma,Zixing Song,Yuzheng Zhuang,Jianye Hao,Irwin King
発行日	2025-03-04 08:24:20+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

A Survey on Vision-Language-Action Models for Embodied AI

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー