Stable Reinforcement Learning for Efficient Reasoning

要約

DeepSeek-R1の成功により、GRPOなどの強化学習（RL）方法に対するLLMコミュニティの注意が集まりました。
ただし、このようなルールベースの0/1結果報酬方法には、考え方（COT）生成中の中間推論プロセスを調節する能力があり、深刻な考え直し現象につながります。
これに応じて、最近の研究は、より短いが正しい完了を生み出す際のモデルの動作を強化するための報酬機能を設計しています。
それにもかかわらず、これらの長さのペナルティ報酬関数はRLトレーニングの不安定性を悪化させていることがわかります。完了長が減少すると、モデルの精度が急激に崩壊し、トレーニングの早い段階で発生することがよくあります。
この問題に対処するために、GRPOの効率的かつ安定したバリアントであるシンプルで効果的なソリューションGRPO-$ \ Lambda $を提案します。これは、各クエリサンプリンググループ内の完成間の正しさ比を監視することにより、報酬戦略を動的に調整します。
低い正しさ比は、COTの品質を損なう長さのペナルティを回避する必要性を示し、推論能力に優先順位を付ける長さと存在する0/1報酬への切り替えをトリガーします。
高い比率は、効率を高めるために長さのペナルティを維持します。
実験結果は、私たちのアプローチが、最適な精度効率のトレードオフを維持しながら、長さのペナルティによって引き起こされるトレーニングの不安定性を回避することを示しています。
GSM8K、GPQA、MATH-500、AMC 2023、およびAIME 2024ベンチマークでは、COTシーケンスの長さを47.3％削減しながら、平均精度を1.48％向上させます。

要約(オリジナル)

The success of Deepseek-R1 has drawn the LLM community’s attention to reinforcement learning (RL) methods like GRPO. However, such rule-based 0/1 outcome reward methods lack the capability to regulate the intermediate reasoning processes during chain-of-thought (CoT) generation, leading to severe overthinking phenomena. In response, recent studies have designed reward functions to reinforce models’ behaviors in producing shorter yet correct completions. Nevertheless, we observe that these length-penalty reward functions exacerbate RL training instability: as the completion length decreases, model accuracy abruptly collapses, often occurring early in training. To address this issue, we propose a simple yet effective solution GRPO-$\lambda$, an efficient and stabilized variant of GRPO, which dynamically adjusts the reward strategy by monitoring the correctness ratio among completions within each query-sampled group. A low correctness ratio indicates the need to avoid length penalty that compromises CoT quality, triggering a switch to length-agnostic 0/1 rewards that prioritize reasoning capability. A high ratio maintains length penalties to boost efficiency. Experimental results show that our approach avoids training instability caused by length penalty while maintaining the optimal accuracy-efficiency trade-off. On the GSM8K, GPQA, MATH-500, AMC 2023, and AIME 2024 benchmarks, it improves average accuracy by 1.48% while reducing CoT sequence length by 47.3%.

arxiv情報

著者	Muzhi Dai,Shixuan Liu,Qingyi Si
発行日	2025-05-23 16:43:03+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Stable Reinforcement Learning for Efficient Reasoning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー