On-Policy RL with Optimal Reward Baseline

要約

強化学習アルゴリズムは、大規模な言語モデルを人間の好みに合わせて、推論能力を高めるための基本です。
ただし、現在の強化学習アルゴリズムは、ポリシー上の制約の緩みと補助モデルによる計算の非効率性により、トレーニングの不安定性に苦しむことがよくあります。
この作業では、これらの課題に対処するために設計された斬新で単純化された強化学習アルゴリズムである最適な報酬ベースライン（OPO）を使用して、オンポリティRLを提案します。
OPOは、トレーニングプロセスを経験的に安定させ、探索を強化する正確なオンポリシートレーニングの重要性を強調しています。
さらに、OPOは、理論的に勾配分散を最小限に抑える最適な報酬ベースラインを導入します。
数学的推論ベンチマークでOPOを評価します。
結果は、追加のモデルや正規化条件なしで、その優れたパフォーマンスとトレーニングの安定性を示しています。
さらに、OPOはより低いポリシーシフトとより高い出力エントロピーを達成し、より多様で繰り返しの少ない応答を促進します。
これらの結果は、大規模な言語モデルのアライメントと推論タスクにおける安定した効果的な強化学習の有望な方向としてOPOを強調しています。
実装はhttps://github.com/microsoft/lmops/tree/main/opoで提供されます。

要約(オリジナル)

Reinforcement learning algorithms are fundamental to align large language models with human preferences and to enhance their reasoning capabilities. However, current reinforcement learning algorithms often suffer from training instability due to loose on-policy constraints and computational inefficiency due to auxiliary models. In this work, we propose On-Policy RL with Optimal reward baseline (OPO), a novel and simplified reinforcement learning algorithm designed to address these challenges. OPO emphasizes the importance of exact on-policy training, which empirically stabilizes the training process and enhances exploration. Moreover, OPO introduces the optimal reward baseline that theoretically minimizes gradient variance. We evaluate OPO on mathematical reasoning benchmarks. The results demonstrate its superior performance and training stability without additional models or regularization terms. Furthermore, OPO achieves lower policy shifts and higher output entropy, encouraging more diverse and less repetitive responses. These results highlight OPO as a promising direction for stable and effective reinforcement learning in large language model alignment and reasoning tasks. The implementation is provided at https://github.com/microsoft/LMOps/tree/main/opo.

arxiv情報

著者	Yaru Hao,Li Dong,Xun Wu,Shaohan Huang,Zewen Chi,Furu Wei
発行日	2025-05-29 15:58:04+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

On-Policy RL with Optimal Reward Baseline

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー