Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO

要約

トレーニング後の段階でのマルチモーダル大手言語モデル（MLLMS）の改善は、通常、監視された微調整（SFT）または強化学習（RL）に依存しています。
ただし、これらの監視された方法には、高価で手動で注釈付きのマルチモーダルデータが必要です。最終的には持続不可能なリソースです。
最近の努力により、監督されていない訓練後の努力が調査されていますが、それらの方法は複雑で反復するのが困難です。
この作業では、外部の監督なしで継続的な自己改善を可能にするために、安定したスケーラブルなオンラインRLアルゴリズムであるGRPOの使用を最初に調査しました。
ML-UPTは、MLLMの監視されていない訓練後のシンプルでありながら効果的なフレームワークであると提案します。
MM-UPはGRPOに基づいて構築され、従来の報酬シグナルを複数のサンプリングされた応答に対する過半数の投票に基づく自己報酬メカニズムに置き換えます。
私たちの実験は、MM-UPTがQWEN2.5-VL-7Bの推論能力を大幅に改善することを示しています（たとえば、66.3％$ \ rightArrow $ 72.9％Mathvistaの72.9％、62.9％$ \ RightArrow $ 68.7％We-Math）。
MM-UPTは、以前の監視されていないベースラインよりも優れており、監視されたGRPOの結果にさえ近づきます。
さらに、MLLM自体によってのみ生成される合成質問を組み込むと、パフォーマンスも向上し、スケーラブルな自己改善のための有望なアプローチを強調することができることを示しています。
全体として、MM-UPは、外部監督がない場合にMLLMの継続的で自律的な強化のための新しいパラダイムを提供します。
私たちのコードは、https：//github.com/waltonfuture/mm-uptで入手できます。

要約(オリジナル)

Improving Multi-modal Large Language Models (MLLMs) in the post-training stage typically relies on supervised fine-tuning (SFT) or reinforcement learning (RL). However, these supervised methods require expensive and manually annotated multi-modal data–an ultimately unsustainable resource. While recent efforts have explored unsupervised post-training, their methods are complex and difficult to iterate. In this work, we are the first to investigate the use of GRPO, a stable and scalable online RL algorithm, for enabling continual self-improvement without any external supervision. We propose MM-UPT, a simple yet effective framework for unsupervised post-training of MLLMs. MM-UPT builds upon GRPO, replacing traditional reward signals with a self-rewarding mechanism based on majority voting over multiple sampled responses. Our experiments demonstrate that MM-UPT significantly improves the reasoning ability of Qwen2.5-VL-7B (e.g., 66.3 %$\rightarrow$72.9 % on MathVista, 62.9 %$\rightarrow$68.7 % on We-Math), using standard dataset without ground truth labels. MM-UPT also outperforms prior unsupervised baselines and even approaches the results of supervised GRPO. Furthermore, we show that incorporating synthetic questions, generated solely by MLLM itself, can boost performance as well, highlighting a promising approach for scalable self-improvement. Overall, MM-UPT offers a new paradigm for continual, autonomous enhancement of MLLMs in the absence of external supervision. Our code is available at https://github.com/waltonfuture/MM-UPT.

arxiv情報

著者	Lai Wei,Yuting Li,Chen Wang,Yue Wang,Linghe Kong,Weiran Huang,Lichao Sun
発行日	2025-05-28 15:11:16+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー