Predicate Invention from Pixels via Pretrained Vision-Language Models

要約

我々の目的は、画像という形の生のセンサー入力が与えられた、変動が激しく、組み合わせ論的に複雑なロボット工学領域における、長期的展望に立った意思決定問題を解くことを学習することである。これまでの研究で、この目的を達成する1つの方法は、記号的述語と演算子の形で構造化された抽象的遷移モデルを学習し、テスト時に新しいタスクを解決するためにこのモデル内で計画を立てることであることが示されている。しかし、これらの学習されたモデルは、ほんの一握りのデモンストレーションから直接ピクセルに接地することはない。本研究では、事前に学習された視覚言語モデル（VLM）の能力を活用することで、入力画像に対して直接操作する述語を発明することを提案する。我々の重要なアイデアは、一組のデモンストレーションが与えられたとき、VLMは意思決定に関連する可能性のある述語のセットを提案し、与えられたデモンストレーションと新しい画像入力の両方において、これらの述語の真理値を決定するために使用できるということである。我々は、オブジェクト中心の状態で動作する特徴に基づく述語を生成する、述語発明のための既存のフレームワークを基に、画像上で動作する視覚的述語も生成する。実験的に、我々は、我々のアプローチ（pix2pred）が、2つのシミュレートされたロボット環境において、新規で、複雑で、長期的なタスクへの汎化を可能にする、意味的に意味のある述語を発明できることを示す。

要約(オリジナル)

Our aim is to learn to solve long-horizon decision-making problems in highly-variable, combinatorially-complex robotics domains given raw sensor input in the form of images. Previous work has shown that one way to achieve this aim is to learn a structured abstract transition model in the form of symbolic predicates and operators, and then plan within this model to solve novel tasks at test time. However, these learned models do not ground directly into pixels from just a handful of demonstrations. In this work, we propose to invent predicates that operate directly over input images by leveraging the capabilities of pretrained vision-language models (VLMs). Our key idea is that, given a set of demonstrations, a VLM can be used to propose a set of predicates that are potentially relevant for decision-making and then to determine the truth values of these predicates in both the given demonstrations and new image inputs. We build upon an existing framework for predicate invention, which generates feature-based predicates operating on object-centric states, to also generate visual predicates that operate on images. Experimentally, we show that our approach — pix2pred — is able to invent semantically meaningful predicates that enable generalization to novel, complex, and long-horizon tasks across two simulated robotic environments.

arxiv情報

著者	Ashay Athalye,Nishanth Kumar,Tom Silver,Yichao Liang,Tomás Lozano-Pérez,Leslie Pack Kaelbling
発行日	2024-12-31 06:14:16+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Predicate Invention from Pixels via Pretrained Vision-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー