Improving Vision-Language-Action Models via Chain-of-Affordance

要約

ロボット基盤モデル、特にビジョン・言語・アクション (VLA) モデルは、ロボットのポリシー学習を強化し、ロボットの汎用性と堅牢性を大幅に向上させる機能で大きな注目を集めています。
OpenAI の最新モデルである o1 は、広範な推論チェーンを利用して複雑な問題を解決する優れた機能を示しました。
これにより、重要な疑問が生じます。ロボットモデルは、以前の観察をレビューし、行動予測を導くタスク固有の推論を提供することで、マルチタスクの複雑な環境でより優れたパフォーマンスを達成できるでしょうか?
この論文では、\textbf{アフォーダンス連鎖 (CoA)} を紹介します。これは、タスクの完了を容易にするために、逐次的なロボットアフォーダンスの形式に推論を組み込むことによってロボットモデルをスケーリングする新しいアプローチです。
具体的には、アクションを実行する前に、モデルに次の 4 種類のアフォーダンスを考慮するよう促します。 a) オブジェクトアフォーダンス – 操作するオブジェクトとその場所。
b) 把握アフォーダンス – 把握する特定のオブジェクト部分。
c) 空間アフォーダンス – オブジェクトを配置するための最適な空間。
d) 移動アフォーダンス – 移動のための衝突のない経路。
この知識をポリシーモデルに統合することで、ロボットは重要なコンテキストを取得し、推論中に精度と堅牢性を高めて動作できるようになります。
私たちの実験は、CoA が OpenVLA や Octo などの最先端のロボット基盤モデルよりも優れたパフォーマンスを達成することを示しています。
さらに、CoA は、目に見えないオブジェクトのポーズに対する強力な一般化を示し、空きスペースを識別し、新しい環境での障害物を回避します。

要約(オリジナル)

Robot foundation models, particularly Vision-Language-Action (VLA) models, have garnered significant attention for their ability to enhance robot policy learning, greatly improving robot generalization and robustness. OpenAI recent model, o1, showcased impressive capabilities in solving complex problems by utilizing extensive reasoning chains. This prompts an important question: can robot models achieve better performance in multi-task, complex environments by reviewing prior observations and then providing task-specific reasoning to guide action prediction? In this paper, we introduce \textbf{Chain-of-Affordance (CoA)}, a novel approach to scaling robot models by incorporating reasoning in the format of sequential robot affordances to facilitate task completion. Specifically, we prompt the model to consider the following four types of affordances before taking action: a) object affordance – what object to manipulate and where it is; b) grasp affordance – the specific object part to grasp; c) spatial affordance – the optimal space to place the object; and d) movement affordance – the collision-free path for movement. By integrating this knowledge into the policy model, the robot gains essential context, allowing it to act with increased precision and robustness during inference. Our experiments demonstrate that CoA achieves superior performance than state-of-the-art robot foundation models, such as OpenVLA and Octo. Additionally, CoA shows strong generalization to unseen object poses, identifies free space, and avoids obstacles in novel environments.

arxiv情報

著者	Jinming Li,Yichen Zhu,Zhibin Tang,Junjie Wen,Minjie Zhu,Xiaoyu Liu,Chengmeng Li,Ran Cheng,Yaxin Peng,Feifei Feng
発行日	2024-12-29 12:24:31+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Improving Vision-Language-Action Models via Chain-of-Affordance

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー