Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning

要約

複雑なテキストタスクにおけるDeepSeek-R1の顕著な推論能力に触発された多くの作品は、補強学習（RL）を直接適用することにより、マルチモーダル大手言語モデル（MLLMS）の同様の機能を奨励しようとします。
しかし、彼らはまだ複雑な推論を活性化するのに苦労しています。
この論文では、マルチモーダルRLを単独で調べるのではなく、現在のトレーニングパイプラインを掘り下げ、3つの重要な現象を特定します。1）効果的なコールドスタート初期化は、MLLMの推論を強化するために重要です。
興味深いことに、慎重に選択されたテキストデータだけで初期化すると、マルチモーダルRLの前であっても、最近のマルチモーダル推論モデルの多くを上回るパフォーマンスにつながる可能性があることがわかります。
2）マルチモーダルRLに適用される標準GRPOは、勾配停滞に苦しみ、トレーニングの安定性とパフォーマンスを低下させます。
3）その後のテキストのみのRLトレーニングは、マルチモーダルRLフェーズに続いて、マルチモーダル推論をさらに強化します。
この段階的なトレーニングアプローチは、知覚的な基盤と認知推論の開発のバランスを効果的にバランスさせます。
上記の洞察を組み込み、マルチモーダルRLの問題に対処することにより、revisual-R1を導入し、Mathverse、Mathvision、Wemath、Wemath、Dynamath、挑戦的なAIME2024およびAIME2025などの挑戦的なベンチマークで、オープンソース7B MLLMの新しい最先端を達成します。

要約(オリジナル)

Inspired by the remarkable reasoning capabilities of Deepseek-R1 in complex textual tasks, many works attempt to incentivize similar capabilities in Multimodal Large Language Models (MLLMs) by directly applying reinforcement learning (RL). However, they still struggle to activate complex reasoning. In this paper, rather than examining multimodal RL in isolation, we delve into current training pipelines and identify three crucial phenomena: 1) Effective cold start initialization is critical for enhancing MLLM reasoning. Intriguingly, we find that initializing with carefully selected text data alone can lead to performance surpassing many recent multimodal reasoning models, even before multimodal RL. 2) Standard GRPO applied to multimodal RL suffers from gradient stagnation, which degrades training stability and performance. 3) Subsequent text-only RL training, following the multimodal RL phase, further enhances multimodal reasoning. This staged training approach effectively balances perceptual grounding and cognitive reasoning development. By incorporating the above insights and addressing multimodal RL issues, we introduce ReVisual-R1, achieving a new state-of-the-art among open-source 7B MLLMs on challenging benchmarks including MathVerse, MathVision, WeMath, LogicVista, DynaMath, and challenging AIME2024 and AIME2025.

arxiv情報

著者	Shuang Chen,Yue Guo,Zhaochen Su,Yafu Li,Yulun Wu,Jiacheng Chen,Jiayu Chen,Weijie Wang,Xiaoye Qu,Yu Cheng
発行日	2025-06-04 17:51:08+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー