VisRL: Intention-Driven Visual Perception via Reinforced Reasoning

要約

視覚的理解は本質的に意図駆動型です – 人間は目標に基づいてシーンのさまざまな地域に選択的に焦点を当てています。
大規模なマルチモーダルモデル（LMMS）の最近の進歩により、自然言語を通じてそのような意図を柔軟に表現できるようになり、視覚的な推論プロセスを導くクエリが可能になります。
視覚的なチェーンのようなフレームワークは、クエリに答える前にモデルがフォーカス領域を予測する明示的な推論ステップを組み込むことの利点を実証しています。
ただし、既存のアプローチは、注釈付き中間境界ボックスを使用した監視付きトレーニングに大きく依存しており、意図領域ペアの組み合わせ爆発によりスケーラビリティを大幅に制限します。
この制限を克服するために、意図駆動型の視覚的知覚の問題に強化学習（RL）を適用する最初のフレームワークであるVisRLを提案します。
VisRLは、報酬信号のみを使用して視覚的推論プロセス全体を最適化します。
中間フォーカス選択を試行錯誤を通じて最適化された内部決定として扱うことにより、私たちの方法は、人間が世界を知覚することをどのように学ぶかとより密接に調整しながら、費用のかかる地域の注釈の必要性を排除します。
複数のベンチマークにわたる広範な実験は、VisRLが強力なベースラインを一貫して優れていることを示しており、その有効性と異なるLMMにわたる強力な一般化の両方を示しています。
私たちのコードは、この[url]（https://github.com/zhangquanchen/visrl）で入手できます。

要約(オリジナル)

Visual understanding is inherently intention-driven – humans selectively focus on different regions of a scene based on their goals. Recent advances in large multimodal models (LMMs) enable flexible expression of such intentions through natural language, allowing queries to guide visual reasoning processes. Frameworks like Visual Chain-of-Thought have demonstrated the benefit of incorporating explicit reasoning steps, where the model predicts a focus region before answering a query. However, existing approaches rely heavily on supervised training with annotated intermediate bounding boxes, which severely limits scalability due to the combinatorial explosion of intention-region pairs. To overcome this limitation, we propose VisRL, the first framework that applies reinforcement learning (RL) to the problem of intention-driven visual perception. VisRL optimizes the entire visual reasoning process using only reward signals. By treating intermediate focus selection as a internal decision optimized through trial-and-error, our method eliminates the need for costly region annotations while aligning more closely with how humans learn to perceive the world. Extensive experiments across multiple benchmarks show that VisRL consistently outperforms strong baselines, demonstrating both its effectiveness and its strong generalization across different LMMs. Our code is available at this [URL](https://github.com/zhangquanchen/VisRL).

arxiv情報

著者	Zhangquan Chen,Xufang Luo,Dongsheng Li
発行日	2025-03-10 16:49:35+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

VisRL: Intention-Driven Visual Perception via Reinforced Reasoning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー