MindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPO

要約

最近のテキストからイメージへのシステムは、マルチモーダル入力と複雑な推論タスクの処理において制限に直面しています。
補強学習を通じて推論生成を組み込むことにより、これらの課題に対処する統一されたマルチモーダル大手言語モデルであるMindomniを紹介します。
Mindomniは、3フェーズトレーニング戦略を活用します。i）デコーダーのみの拡散モジュールを使用した統一ビジョン言語モデルの設計、ii）監視された微調整（COT）命令データ、およびIII）提案された推論生成ポリシー最適化（RGPO）アルゴリスム、マルチモダルフィードバックを実現してポリシーの更新を実質的にガイドする。
実験結果は、Mindomniが既存のモデルを上回り、理解と生成のベンチマークの両方で印象的なパフォーマンスを達成し、一方、特に数学的推論指示により、高度な微調整された推論生成能力を紹介することを示しています。
すべてのコードはhttps://github.com/tencentarc/mindomniで公開されます

要約(オリジナル)

Recent text-to-image systems face limitations in handling multimodal inputs and complex reasoning tasks. We introduce MindOmni, a unified multimodal large language model that addresses these challenges by incorporating reasoning generation through reinforcement learning. MindOmni leverages a three-phase training strategy: i) design of a unified vision language model with a decoder-only diffusion module, ii) supervised fine-tuning with Chain-of-Thought (CoT) instruction data, and iii) our proposed Reasoning Generation Policy Optimization (RGPO) algorithm, utilizing multimodal feedback to effectively guide policy updates. Experimental results demonstrate that MindOmni outperforms existing models, achieving impressive performance on both understanding and generation benchmarks, meanwhile showcasing advanced fine-grained reasoning generation capabilities, especially with mathematical reasoning instruction. All codes will be made public at https://github.com/TencentARC/MindOmni

arxiv情報

著者	Yicheng Xiao,Lin Song,Yukang Chen,Yingmin Luo,Yuxin Chen,Yukang Gan,Wei Huang,Xiu Li,Xiaojuan Qi,Ying Shan
発行日	2025-06-11 15:44:25+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPO

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー