From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models

要約

Vision-Language-action（VLA）モデルがロボット工学の従来の模倣学習を保持するという約束の1つは、大規模なビジョン言語モデル（VLM）の広範な一般化能力を活用して、多用途の「ジェネラリスト」ロボットポリシーを生成することです。
ただし、VLAの現在の評価は不十分なままです。
従来の模倣学習ベンチマークは、言語の指示がないために不適切です。
言語を組み込んだVLAの新しいベンチマークには、しばしば限られた評価タスクが付いていることが多く、VLM Pretrainingがダウンストリームロボットポリシーの一般化能力にどの程度貢献するかを調査するつもりはありません。
一方、多くの研究は、さまざまな機関によって単独で設計された現実世界のロボットセットアップに依存しており、再現性とアクセシビリティの障壁を作成します。
このギャップに対処するために、言語の指導、ビジョン、およびオブジェクトにまたがる10のサブカテゴリにわたって50のシミュレーションベースのタスクの統一されたプローブスイートを導入します。
このスイートのいくつかの最先端のVLAアーキテクチャを体系的に評価して、一般化能力を理解しています。
我々の結果は、VLMバックボーンは、堅牢な知覚的理解と高レベルの計画を備えたVLAをVLASに寄付しますが、これは善意と呼ばれますが、これは正確な運動実行に確実に変換されません。分散型の観察に直面した場合、ポリシーはしばしば一貫性のある意図を示しますが、行動の実行において動きます。
さらに、アクションデータでの微調整は、元のVLMのジェネラリストの推論能力を侵食する可能性があります。
タスクスイートと評価コードをリリースして、将来のVLAの標準化されたベンチマークとして機能し、認識から行動へのギャップを埋めるための研究を推進します。
ソースコードを含む詳細については、https：//ai4ce.github.io/int-act/をご覧ください。

要約(オリジナル)

One promise that Vision-Language-Action (VLA) models hold over traditional imitation learning for robotics is to leverage the broad generalization capabilities of large Vision-Language Models (VLMs) to produce versatile, ‘generalist’ robot policies. However, current evaluations of VLAs remain insufficient. Traditional imitation learning benchmarks are unsuitable due to the lack of language instructions. Emerging benchmarks for VLAs that incorporate language often come with limited evaluation tasks and do not intend to investigate how much VLM pretraining truly contributes to the generalization capabilities of the downstream robotic policy. Meanwhile, much research relies on real-world robot setups designed in isolation by different institutions, which creates a barrier for reproducibility and accessibility. To address this gap, we introduce a unified probing suite of 50 simulation-based tasks across 10 subcategories spanning language instruction, vision, and objects. We systematically evaluate several state-of-the-art VLA architectures on this suite to understand their generalization capability. Our results show that while VLM backbones endow VLAs with robust perceptual understanding and high level planning, which we refer to as good intentions, this does not reliably translate into precise motor execution: when faced with out-of-distribution observations, policies often exhibit coherent intentions, but falter in action execution. Moreover, finetuning on action data can erode the original VLM’s generalist reasoning abilities. We release our task suite and evaluation code to serve as a standardized benchmark for future VLAs and to drive research on closing the perception-to-action gap. More information, including the source code, can be found at https://ai4ce.github.io/INT-ACT/

arxiv情報

著者	Irving Fang,Juexiao Zhang,Shengbang Tong,Chen Feng
発行日	2025-06-11 16:52:18+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー