Spurious Rewards: Rethinking Training Signals in RLVR

要約

検証可能な報酬（RLVR）による補強学習は、正解とほとんど、否定的、または否定的な相関を持つ偽の報酬がある場合でも、特定のモデルで強い数学的推論を引き出すことができることを示しています。
たとえば、RLVRは、絶対ポイントでのQWEN2.5-MATH-7BのMATH-500パフォーマンスを21.4％（ランダム報酬）、13.8％（フォーマット報酬）、24.1％（誤ったラベル）、26.0％（1ショットRL）、27.1％（多数票）を改善します。
ただし、Qwenで機能する偽の報酬は、Llama3やOlmo2などの他のモデルファミリと利益をもたらさないことがよくあります。
特に、実際のコード実行なしでコードで考えているコード推論 – は、RLVR後に65％から90％以上の頻繁になる特徴的なQWEN2.5-MATH動作であると考えています。
全体として、有用な報酬シグナルがないことを考えると、RLVRは、事前に取引中に学んだ有用な推論表現を何らかの形で浮上させる必要があると仮定しますが、正確なメカニズムは将来の仕事のトピックのままです。
将来のRLVRの研究は、完全に偽りの報酬信号でさえQwenモデルで大幅なパフォーマンスの向上を獲得することが容易であることを示すため、将来のRLVR研究は単一の事実上の選択ではなく、多様なモデルで検証されるべきであることをお勧めします。

要約(オリジナル)

We show that reinforcement learning with verifiable rewards (RLVR) can elicit strong mathematical reasoning in certain models even with spurious rewards that have little, no, or even negative correlation with the correct answer. For example, RLVR improves MATH-500 performance for Qwen2.5-Math-7B in absolute points by 21.4% (random reward), 13.8% (format reward), 24.1% (incorrect label), 26.0% (1-shot RL), and 27.1% (majority voting) — nearly matching the 29.1% gained with ground truth rewards. However, the spurious rewards that work for Qwen often fail to yield gains with other model families like Llama3 or OLMo2. In particular, we find code reasoning — thinking in code without actual code execution — to be a distinctive Qwen2.5-Math behavior that becomes significantly more frequent after RLVR, from 65% to over 90%, even with spurious rewards. Overall, we hypothesize that, given the lack of useful reward signal, RLVR must somehow be surfacing useful reasoning representations learned during pretraining, although the exact mechanism remains a topic for future work. We suggest that future RLVR research should possibly be validated on diverse models rather than a single de facto choice, as we show that it is easy to get significant performance gains on Qwen models even with completely spurious reward signals.

arxiv情報

著者	Rulin Shao,Shuyue Stella Li,Rui Xin,Scott Geng,Yiping Wang,Sewoong Oh,Simon Shaolei Du,Nathan Lambert,Sewon Min,Ranjay Krishna,Yulia Tsvetkov,Hannaneh Hajishirzi,Pang Wei Koh,Luke Zettlemoyer
発行日	2025-06-12 17:49:55+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Spurious Rewards: Rethinking Training Signals in RLVR

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー