OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement

要約

Deepseek-R1によって実証された最近の進歩は、自己検証や自己修正などの洗練された行動を含む、大規模な言語モデル（LLM）の複雑な推論能力が、検証可能な報酬でRLによって達成され、AIMEなどの挑戦的なタスクのモデルパフォーマンスが大幅に向上することを示しています。
これらの調査結果に動機付けられている私たちの研究では、同様の推論機能を大規模な視覚言語モデル（LVLMS）にうまく統合できるかどうかを調査し、挑戦的なマルチモーダル推論タスクへの影響を評価します。
モデルの一般化をさらに改善するために、軽量トレーニングデータと強化学習（RL）の監視された微調整（SFT）を繰り返し活用するアプローチを検討します。
当初、推論機能は、多様な視覚データセットから供給された画像の高品質のキャプションを使用して推論ステップを生成することにより、純粋なテキストR1モデルから蒸留されました。
その後、反復RLトレーニングは、各反復のRL改善モデルが次のラウンドの洗練されたSFTデータセットを生成することで、推論スキルをさらに向上させます。
この反復プロセスは、Mathvista、Mathverse、MathVisionなどの挑戦的なベンチマークの推論パフォーマンスを一貫して改善するLVLMであるOpenVLThinkerを生み出し、堅牢なビジョン言語推論のための戦略の可能性を実証しました。
コード、モデル、データはhttps://github.com/yihedeng9/openvlthinkerに保持されています。

要約(オリジナル)

Recent advancements demonstrated by DeepSeek-R1 have shown that complex reasoning abilities in large language models (LLMs), including sophisticated behaviors such as self-verification and self-correction, can be achieved by RL with verifiable rewards and significantly improves model performance on challenging tasks such as AIME. Motivated by these findings, our study investigates whether similar reasoning capabilities can be successfully integrated into large vision-language models (LVLMs) and assesses their impact on challenging multimodal reasoning tasks. We consider an approach that iteratively leverages supervised fine-tuning (SFT) on lightweight training data and Reinforcement Learning (RL) to further improve model generalization. Initially, reasoning capabilities were distilled from pure-text R1 models by generating reasoning steps using high-quality captions of the images sourced from diverse visual datasets. Subsequently, iterative RL training further enhance reasoning skills, with each iteration’s RL-improved model generating refined SFT datasets for the next round. This iterative process yielded OpenVLThinker, a LVLM exhibiting consistently improved reasoning performance on challenging benchmarks such as MathVista, MathVerse, and MathVision, demonstrating the potential of our strategy for robust vision-language reasoning. The code, model and data are held at https://github.com/yihedeng9/OpenVLThinker.

arxiv情報

著者	Yihe Deng,Hritik Bansal,Fan Yin,Nanyun Peng,Wei Wang,Kai-Wei Chang
発行日	2025-03-21 17:52:43+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー