G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language Model via Reinforcement Learning

要約

ビジョン言語モデル（VLM）は、多くの直接的なマルチモーダルタスクで優れていますが、この能力をゲームのようなインタラクティブで視覚的に豊富な環境内で効果的な意思決定に変換するのに苦労しています。
この「知識」ギャップは、自律的なエージェントとしての可能性を大幅に制限します。
これに対処するために、統一されたマルチゲームパラレルトレーニング用に特別に設計された、統一されたインターフェイスと調整可能な組成難易度を備えた多様なビジュアルゲームを備えたキュレーションされた強化学習（RL）環境であるVLM-GYMを紹介します。
VLM-GYMを活用すると、純粋なRL駆動型の自己進化を使用してG0モデルをトレーニングします。これは、緊急の知覚と推論パターンを実証します。
ゲームの多様性から生じる課題をさらに軽減するために、G1モデルを開発します。
G1には、RL微調整の前に、知覚が強化されたコールドスタートが組み込まれています。
結果として得られるG1モデルは、すべてのゲームで教師を一貫して上回り、Claude-3.7-Sonnetを考えているような主要な独自モデルよりも優れています。
体系的な分析により、興味深い発見が明らかになります。RLトレーニングプロセス全体を通して、知覚と推論能力が相互に互いにブートストラップします。
VLM-GYMおよびRLトレーニングを含むソースコードは、https://github.com/chenllliang/g1でリリースされ、VLMを有能なインタラクティブエージェントとして進める将来の研究を促進します。

要約(オリジナル)

Vision-Language Models (VLMs) excel in many direct multimodal tasks but struggle to translate this prowess into effective decision-making within interactive, visually rich environments like games. This “knowing-doing” gap significantly limits their potential as autonomous agents, as leading VLMs often performing badly in simple games. To address this, we introduce VLM-Gym, a curated reinforcement learning (RL) environment featuring diverse visual games with unified interfaces and adjustable, compositional difficulty, specifically designed for scalable multi-game parallel training. Leveraging VLM-Gym, we train G0 models using pure RL-driven self-evolution, which demonstrate emergent perception and reasoning patterns. To further mitigate challenges arising from game diversity, we develop G1 models. G1 incorporates a perception-enhanced cold start prior to RL fine-tuning. Our resulting G1 models consistently surpass their teacher across all games and outperform leading proprietary models like Claude-3.7-Sonnet-Thinking. Systematic analysis reveals an intriguing finding: perception and reasoning abilities mutually bootstrap each other throughout the RL training process. Source code including VLM-Gym and RL training are released at https://github.com/chenllliang/G1 to foster future research in advancing VLMs as capable interactive agents.

arxiv情報

著者	Liang Chen,Hongcheng Gao,Tianyu Liu,Zhiqi Huang,Flood Sung,Xinyu Zhou,Yuxin Wu,Baobao Chang
発行日	2025-05-19 17:54:39+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language Model via Reinforcement Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー