Learning to Reason under Off-Policy Guidance

要約

大規模な推論モデル（LRMS）の最近の進歩は、マルチステップ推論や自己反省などの洗練された行動が、単純なルールベースの報酬を使用して、強化学習（RL）を介して出現する可能性があることを示しています。
ただし、既存のゼロRLアプローチは本質的に「オンポリティ」であり、モデル自身の出力に学習を制限し、初期の機能を超えて推論能力を獲得できません。
Luffy（Policy Off Policyガイダンスの下で推論することを学ぶ）を紹介します。
Luffyは、トレーニング中にポリシーオフポリシーのデモとポリシーのロールアウトを組み合わせることにより、模倣と探索のバランスをとります。
特に、混合ポリシートレーニング中の表面的で厳格な模倣を避けるために、正規化された重要性サンプリングを介してポリシーの形成を提案します。
驚くべきことに、Luffyは、6つの数学ベンチマークにわたって+7.0以上の平均ゲインを達成し、分散除外タスクで+6.2ポイント以上の利点を達成しています。
また、特に一般化において、模倣ベースの監視付き微調整（SFT）を大幅に上回ります。
分析によると、Luffyは効果的に模倣するだけでなく、デモンストレーションを超えて探索し、一般化可能な推論モデルをオフポリシーガイダンスでトレーニングするためのスケーラブルなパスを提供します。

要約(オリジナル)

Recent advances in large reasoning models (LRMs) demonstrate that sophisticated behaviors such as multi-step reasoning and self-reflection can emerge via reinforcement learning (RL) with simple rule-based rewards. However, existing zero-RL approaches are inherently “on-policy”, limiting learning to a model’s own outputs and failing to acquire reasoning abilities beyond its initial capabilities. We introduce LUFFY (Learning to reason Under oFF-policY guidance), a framework that augments zero-RL with off-policy reasoning traces. LUFFY dynamically balances imitation and exploration by combining off-policy demonstrations with on-policy rollouts during training. Notably, we propose policy shaping via regularized importance sampling to avoid superficial and rigid imitation during mixed-policy training. Remarkably, LUFFY achieves an over +7.0 average gain across six math benchmarks and an advantage of over +6.2 points in out-of-distribution tasks. It also substantially surpasses imitation-based supervised fine-tuning (SFT), particularly in generalization. Analysis shows LUFFY not only imitates effectively but also explores beyond demonstrations, offering a scalable path to train generalizable reasoning models with off-policy guidance.

arxiv情報

著者	Jianhao Yan,Yafu Li,Zican Hu,Zhi Wang,Ganqu Cui,Xiaoye Qu,Yu Cheng,Yue Zhang
発行日	2025-04-21 08:09:13+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Learning to Reason under Off-Policy Guidance

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー