Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning

要約

次世代のマルチモーダル推論モデルであるSkywork R1v2と、その前身であるSkywork R1Vからのメジャーリープを紹介します。
R1v2は、その中心で、報酬モデルガイダンスをルールベースの戦略と調和させるハイブリッド強化学習パラダイムを導入し、それによって洗練された推論能力と広範な一般化のバランスをとるという長年の課題に対処します。
トレーニング効率をさらに向上させるために、最適化プロセス全体で高価値サンプルに優先順位を付けることにより、グループ相対ポリシー最適化（GRPO）に固有の「消失の利点」ジレンマを効果的にカウンターする選択的サンプルバッファー（SSB）メカニズムを提案します。
特に、過度の補強信号が視覚的な幻覚を誘発する可能性があることが観察されます。これは、トレーニングプロセス全体で調整された報酬のしきい値を体系的に監視および軽減する現象です。
経験的な結果は、R1v2の例外的な能力を確認し、オリンピアドベンチで62.6、AIME2024で79.0、LiveCodebenchで63.6、MMMUで74.0などのベンチマークをリードするパフォーマンスを確認します。
これらの結果は、既存のオープンソースモデルに対するR1v2の優位性を強調し、Gemini 2.5やOpenai O4-Miniを含む最高の独自のシステムでパフォーマンスギャップを埋めることに大きな進歩を示しています。
Skywork R1v2モデルの重量は、開放性と再現性を促進するために公開されていますhttps://huggingface.co/skywork/skywork-r1v2-38b。

要約(オリジナル)

We present Skywork R1V2, a next-generation multimodal reasoning model and a major leap forward from its predecessor, Skywork R1V. At its core, R1V2 introduces a hybrid reinforcement learning paradigm that harmonizes reward-model guidance with rule-based strategies, thereby addressing the long-standing challenge of balancing sophisticated reasoning capabilities with broad generalization. To further enhance training efficiency, we propose the Selective Sample Buffer (SSB) mechanism, which effectively counters the “Vanishing Advantages” dilemma inherent in Group Relative Policy Optimization (GRPO) by prioritizing high-value samples throughout the optimization process. Notably, we observe that excessive reinforcement signals can induce visual hallucinations–a phenomenon we systematically monitor and mitigate through calibrated reward thresholds throughout the training process. Empirical results affirm the exceptional capability of R1V2, with benchmark-leading performances such as 62.6 on OlympiadBench, 79.0 on AIME2024, 63.6 on LiveCodeBench, and 74.0 on MMMU. These results underscore R1V2’s superiority over existing open-source models and demonstrate significant progress in closing the performance gap with premier proprietary systems, including Gemini 2.5 and OpenAI o4-mini. The Skywork R1V2 model weights have been publicly released to promote openness and reproducibility https://huggingface.co/Skywork/Skywork-R1V2-38B.

arxiv情報

著者	Chris,Yichen Wei,Yi Peng,Xiaokun Wang,Weijie Qiu,Wei Shen,Tianyidan Xie,Jiangbo Pei,Jianhao Zhang,Yunzhuo Hao,Xuchen Song,Yang Liu,Yahui Zhou
発行日	2025-04-23 12:24:10+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー