Exploring the Limits of Vision-Language-Action Manipulations in Cross-task Generalization

要約

目に見えないタスクに対するビジョン言語アクション（VLA）モデルの一般化能力は、オープンワールド設定での汎用ロボット操作を達成するために重要です。
ただし、既存のVLAモデルのクロスタスク一般化機能は、既存の依存症状のままです。
このギャップに対処するために、操作におけるクロスタスクゼロショットの一般化を厳密に評価するように設計された新しいシミュレーションベンチマークであるAgnostosを紹介します。
Agnostosは、一般的なトレーニングタスク分布とは異なる23の目に見えない操作タスクで構成され、2つのレベルの一般化難易度を組み込んで堅牢性を評価します。
私たちの体系的な評価は、現在のVLAモデルは、多様なデータセットで訓練されているにもかかわらず、これらの目に見えないタスクに効果的に一般化するのに苦労していることを明らかにしています。
この制限を克服するために、クロスタスク内操作（X-ICM）を提案します。これは、見られたタスクからのコンテキスト内デモンストレーションに大きな言語モデル（LLM）を条件付けて、目に見えないタスクのアクションシーケンスを予測することを提案します。
さらに、クロスタスクダイナミクスをキャプチャすることにより、関連するデモンストレーションを識別するダイナミクスガイドのサンプル選択戦略を導入します。
Agnostosでは、X-ICMは、主要なVLAよりもクロスタスクゼロショット一般化パフォーマンスを大幅に改善します。
AgnostosとX-ICMは、汎用ロボット操作を進めるための貴重なツールとして役立つと考えています。

要約(オリジナル)

The generalization capabilities of vision-language-action (VLA) models to unseen tasks are crucial to achieving general-purpose robotic manipulation in open-world settings. However, the cross-task generalization capabilities of existing VLA models remain significantly underexplored. To address this gap, we introduce AGNOSTOS, a novel simulation benchmark designed to rigorously evaluate cross-task zero-shot generalization in manipulation. AGNOSTOS comprises 23 unseen manipulation tasks for testing, distinct from common training task distributions, and incorporates two levels of generalization difficulty to assess robustness. Our systematic evaluation reveals that current VLA models, despite being trained on diverse datasets, struggle to generalize effectively to these unseen tasks. To overcome this limitation, we propose Cross-Task In-Context Manipulation (X-ICM), a method that conditions large language models (LLMs) on in-context demonstrations from seen tasks to predict action sequences for unseen tasks. Additionally, we introduce a dynamics-guided sample selection strategy that identifies relevant demonstrations by capturing cross-task dynamics. On AGNOSTOS, X-ICM significantly improves cross-task zero-shot generalization performance over leading VLAs. We believe AGNOSTOS and X-ICM will serve as valuable tools for advancing general-purpose robotic manipulation.

arxiv情報

著者	Jiaming Zhou,Ke Ye,Jiayi Liu,Teli Ma,Zifang Wang,Ronghe Qiu,Kun-Yu Lin,Zhilin Zhao,Junwei Liang
発行日	2025-05-21 15:35:57+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Exploring the Limits of Vision-Language-Action Manipulations in Cross-task Generalization

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー