From Pixels to Predicates: Learning Symbolic World Models via Pretrained Vision-Language Models

要約

私たちの目的は、低レベルのスキルと、一連の画像を含む少数の短距離デモンストレーションを考慮して、複雑なロボット工学ドメインで長老の意思決定の問題を解決することを学ぶことです。
この目的のために、私たちは、計画を介して新しい目標へのゼロショットの一般化を促進する抽象的な象徴的な世界モデルを学ぶことに焦点を当てています。
このようなモデルの重要なコンポーネントは、オブジェクト間の特性と関係を定義するシンボリック述語のセットです。
この作業では、事前に処理されたビジョン言語モデル（VLM）を活用して、意思決定に潜在的に関連する視覚的述語の大規模なセットを提案し、カメラ画像から直接述語を評価することを提案します。
トレーニング時に、提案された述語とデモンストレーションを最適化ベースのモデル学習アルゴリズムに渡して、提案された述語のコンパクトなサブセットに関して定義された抽象的なシンボリックワールドモデルを取得します。
テスト時に、新しい設定で新しい目標を考慮して、VLMを使用して現在の世界状態の象徴的な説明を作成し、検索ベースの計画アルゴリズムを使用して、目標を達成する低レベルのスキルのシーケンスを見つけます。
シミュレーションと現実世界の両方で実験全体で経験的に実証し、私たちの方法は積極的に一般化し、学んだ世界モデルを適用して、さまざまなオブジェクトタイプ、アレンジメント、オブジェクトの数、視覚的背景、およびトレーニング時に見られるものよりもはるかに長い視野で問題を解決できることを実証します。

要約(オリジナル)

Our aim is to learn to solve long-horizon decision-making problems in complex robotics domains given low-level skills and a handful of short-horizon demonstrations containing sequences of images. To this end, we focus on learning abstract symbolic world models that facilitate zero-shot generalization to novel goals via planning. A critical component of such models is the set of symbolic predicates that define properties of and relationships between objects. In this work, we leverage pretrained vision language models (VLMs) to propose a large set of visual predicates potentially relevant for decision-making, and to evaluate those predicates directly from camera images. At training time, we pass the proposed predicates and demonstrations into an optimization-based model-learning algorithm to obtain an abstract symbolic world model that is defined in terms of a compact subset of the proposed predicates. At test time, given a novel goal in a novel setting, we use the VLM to construct a symbolic description of the current world state, and then use a search-based planning algorithm to find a sequence of low-level skills that achieves the goal. We demonstrate empirically across experiments in both simulation and the real world that our method can generalize aggressively, applying its learned world model to solve problems with a wide variety of object types, arrangements, numbers of objects, and visual backgrounds, as well as novel goals and much longer horizons than those seen at training time.

arxiv情報

著者	Ashay Athalye,Nishanth Kumar,Tom Silver,Yichao Liang,Jiuguang Wang,Tomás Lozano-Pérez,Leslie Pack Kaelbling
発行日	2025-06-10 03:08:29+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

From Pixels to Predicates: Learning Symbolic World Models via Pretrained Vision-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー