Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation

要約

複雑な長老のロボット操作の問題を解決するには、洗練された高レベルの計画能力、物理的世界について推論する能力、および適切な運動能力を反応的に選択する能力が必要です。
インターネットデータで前処理されたビジョン言語モデル（VLM）は、原則として、このような問題に取り組むためのフレームワークを提供する可能性があります。
ただし、現在の形式では、VLMはロボット操作に必要な複雑な物理学の微妙な理解と、エラーの複合問題に対処するために長い視野を超えて推論する能力の両方に欠けています。
この論文では、マルチステージ操作タスクのVLMSの物理的推論機能を強化する新しいテスト時間計算フレームワークを紹介します。
そのアプローチでは、私たちのアプローチは、「反射」メカニズムを備えた前提条件のVLMを繰り返し改善します。生成モデルを使用して、将来の世界状態を想像し、これらの予測を活用してアクション選択を導き、潜在的な亜極性を批判的に反映して推論を改善します。
実験結果は、私たちの方法が、いくつかの最先端の商用VLMと、モンテカルロツリー検索（MCTS）などの他のトレーニング後のアプローチを大幅に上回ることを示しています。
ビデオはhttps://reflect-vlm.github.ioで入手できます。

要約(オリジナル)

Solving complex long-horizon robotic manipulation problems requires sophisticated high-level planning capabilities, the ability to reason about the physical world, and reactively choose appropriate motor skills. Vision-language models (VLMs) pretrained on Internet data could in principle offer a framework for tackling such problems. However, in their current form, VLMs lack both the nuanced understanding of intricate physics required for robotic manipulation and the ability to reason over long horizons to address error compounding issues. In this paper, we introduce a novel test-time computation framework that enhances VLMs’ physical reasoning capabilities for multi-stage manipulation tasks. At its core, our approach iteratively improves a pretrained VLM with a ‘reflection’ mechanism – it uses a generative model to imagine future world states, leverages these predictions to guide action selection, and critically reflects on potential suboptimalities to refine its reasoning. Experimental results demonstrate that our method significantly outperforms several state-of-the-art commercial VLMs as well as other post-training approaches such as Monte Carlo Tree Search (MCTS). Videos are available at https://reflect-vlm.github.io.

arxiv情報

著者	Yunhai Feng,Jiaming Han,Zhuoran Yang,Xiangyu Yue,Sergey Levine,Jianlan Luo
発行日	2025-02-23 20:42:15+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー