Visual Planning: Let’s Think Only with Images

要約

大規模な言語モデル（LLMS）とそのマルチモーダル拡張（MLLM）の最近の進歩は、多様なタスク全体の機械推論を大幅に強化しています。
ただし、これらのモデルは、視覚情報が存在する場合でも、推論を表現および構造化する両方の媒体として純粋なテキストに依存しています。
この作業では、言語は、特に空間的情報と幾何学的情報を含むタスクで、推論にとって常に最も自然または効果的なモダリティであるとは限らないと主張します。
これに動機付けられて、私たちは新しいパラダイムである視覚計画を提案します。これにより、テキストとは無関係に純粋に視覚的な表現を介した計画が可能になります。
このパラダイムでは、計画は、人間が将来のアクションをスケッチまたは視覚化する方法と同様に、視覚ドメインで段階的な推論をエンコードする一連の画像を介して実行されます。
新しい強化学習フレームワーク、強化学習（VPRL）を介した視覚計画（VPRL）を紹介し、トレーニング後の大規模な視覚モデルにGRPOによって力を与え、代表的な視覚ナビゲーションタスク、フローズレイク、迷路、ミニベハビオールの選択の計画の大幅な改善につながります。
私たちの視覚計画のパラダイムは、テキストのみの空間で推論を行う他のすべての計画バリアントを上回ります。
私たちの結果は、言語ベースの推論に代わる実行可能で有望な代替として視覚計画を確立し、直感的で画像ベースの推論から利益を得るタスクの新しい道を開きます。

要約(オリジナル)

Recent advancements in Large Language Models (LLMs) and their multimodal extensions (MLLMs) have substantially enhanced machine reasoning across diverse tasks. However, these models predominantly rely on pure text as the medium for both expressing and structuring reasoning, even when visual information is present. In this work, we argue that language may not always be the most natural or effective modality for reasoning, particularly in tasks involving spatial and geometrical information. Motivated by this, we propose a new paradigm, Visual Planning, which enables planning through purely visual representations, independent of text. In this paradigm, planning is executed via sequences of images that encode step-by-step inference in the visual domain, akin to how humans sketch or visualize future actions. We introduce a novel reinforcement learning framework, Visual Planning via Reinforcement Learning (VPRL), empowered by GRPO for post-training large vision models, leading to substantial improvements in planning in a selection of representative visual navigation tasks, FrozenLake, Maze, and MiniBehavior. Our visual planning paradigm outperforms all other planning variants that conduct reasoning in the text-only space. Our results establish Visual Planning as a viable and promising alternative to language-based reasoning, opening new avenues for tasks that benefit from intuitive, image-based inference.

arxiv情報

著者	Yi Xu,Chengzu Li,Han Zhou,Xingchen Wan,Caiqi Zhang,Anna Korhonen,Ivan Vulić
発行日	2025-05-16 16:17:22+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Visual Planning: Let’s Think Only with Images

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー