Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning

要約

次世代のマルチモーダル推論モデルであるSkywork R1v2と、その前身であるSkywork R1Vからのメジャーリープを紹介します。
R1v2は、その中心で、混合選好最適化（MPO）とグループ相対ポリシーの最適化（GRPO）を共同でレバレッジするハイブリッド強化学習パラダイムを導入します。これは、報酬モデルガイダンスをルールベースの戦略と調和させ、それによって広範な一般化を備えた洗練された推論カピバリティのバランスをとる洗練された課題に対処します。
トレーニング効率をさらに向上させるために、最適化プロセス全体で高価値サンプルに優先順位を付けることにより、GRPOに固有の「消滅する利点」のジレンマを効果的にカウンターする選択的サンプルバッファー（SSB）メカニズムを導入します。
特に、過度の補強信号が視覚的な幻覚を誘発する可能性があることが観察されます。これは、トレーニングプロセス全体で調整された報酬のしきい値を体系的に監視および軽減する現象です。
経験的結果は、R1v2の例外的な能力を確認し、62.6などのベンチマークをリードするパフォーマンス、Olympiadbenchでは78.9、LiveCodebenchで63.6、MMMUで73.6などのベンチマークをリードするパフォーマンスを確認します。
これらの結果は、既存のオープンソースモデルに対するR1v2の優位性を強調し、Gemini 2.5やOpenai-O4-Miniを含む最高の独自のシステムでパフォーマンスギャップを埋めることに大きな進歩を示しています。
Skywork R1v2モデルの重量は、開放性と再現性を促進するために公開されていますhttps://huggingface.co/skywork/skywork-r1v2-38b。

要約(オリジナル)

We present Skywork R1V2, a next-generation multimodal reasoning model and a major leap forward from its predecessor, Skywork R1V. At its core, R1V2 introduces a hybrid reinforcement learning paradigm that jointly leverages the Mixed Preference Optimization (MPO) and the Group Relative Policy Optimization (GRPO), which harmonizes reward-model guidance with rule-based strategies, thereby addressing the long-standing challenge of balancing sophisticated reasoning capabilities with broad generalization. To further enhance training efficiency, we introduce the Selective Sample Buffer (SSB) mechanism, which effectively counters the “Vanishing Advantages” dilemma inherent in GRPO by prioritizing high-value samples throughout the optimization process. Notably, we observe that excessive reinforcement signals can induce visual hallucinations–a phenomenon we systematically monitor and mitigate through calibrated reward thresholds throughout the training process. Empirical results affirm the exceptional capability of R1V2, with benchmark-leading performances such as 62.6 on OlympiadBench, 78.9 on AIME2024, 63.6 on LiveCodeBench, and 73.6 on MMMU. These results underscore R1V2’s superiority over existing open-source models and demonstrate significant progress in closing the performance gap with premier proprietary systems, including Gemini 2.5 and OpenAI-o4-mini. The Skywork R1V2 model weights have been publicly released to promote openness and reproducibility https://huggingface.co/Skywork/Skywork-R1V2-38B.

arxiv情報

著者	Chris,Yichen Wei,Yi Peng,Xiaokun Wang,Weijie Qiu,Wei Shen,Tianyidan Xie,Jiangbo Pei,Jianhao Zhang,Yunzhuo Hao,Xuchen Song,Yang Liu,Yahui Zhou
発行日	2025-04-25 15:28:34+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー