Tapered Off-Policy REINFORCE: Stable and efficient reinforcement learning for LLMs

要約

強化学習を使用して、大規模な言語モデルを微調整するための新しいアルゴリズムを提案します。
Tapered Off-Policy Renforce（TOPR）は、KLの正規化を使用しなくても、安定した学習ダイナミクスを維持しながら、学習を高速化するために重要なサンプリングの非対称でテーパー付きのバリアントを使用します。
TOPRは完全にオフラインで適用でき、統一されたフレームワークで肯定的な例と否定的な例を処理し、モンテカルロアルゴリズムに典型的な実装シンプルさから利益を得ることができます。
GSM8Kおよび数学推論ベンチマークでの一連の実験を使用して、アプローチの有効性を実証し、ソリューション生成のモデルと生成検証剤の両方のトレーニングのためのパフォーマンスの向上を見つけます。
ポリシー外のレジームで同時に肯定的および否定的な例を適切に活用すると同時に、テスト時間の精度とトレーニングデータの効率が向上することを示します。
この利点は、トレーニングの複数の反復にわたって持続し、データセットキュレーション技術によって増幅される可能性があり、70Bパラメーターモデルのパフォーマンスと8B言語モデルを一致させることができます。
この作業の結果として、Renforceのベースラインパラメーターは、否定的な例の存在下でデータセット構成を定義する上で重要かつ予想外の役割を果たし、その結果、ポリシー外のパフォーマンスを駆動する上で重要であることがわかります。

要約(オリジナル)

We propose a new algorithm for fine-tuning large language models using reinforcement learning. Tapered Off-Policy REINFORCE (TOPR) uses an asymmetric, tapered variant of importance sampling to speed up learning while maintaining stable learning dynamics, even without the use of KL regularization. TOPR can be applied in a fully offline fashion, allows the handling of positive and negative examples in a unified framework, and benefits from the implementational simplicity that is typical of Monte Carlo algorithms. We demonstrate the effectiveness of our approach with a series of experiments on the GSM8K and MATH reasoning benchmarks, finding performance gains for training both a model for solution generation and as a generative verifier. We show that properly leveraging positive and negative examples alike in the off-policy regime simultaneously increases test-time accuracy and training data efficiency, all the while avoiding the “wasted inference” that comes with discarding negative examples. We find that this advantage persists over multiple iterations of training and can be amplified by dataset curation techniques, enabling us to match 70B-parameter model performance with 8B language models. As a corollary to this work, we find that REINFORCE’s baseline parameter plays an important and unexpected role in defining dataset composition in the presence of negative examples, and is consequently critical in driving off-policy performance.

arxiv情報

著者	Nicolas Le Roux,Marc G. Bellemare,Jonathan Lebensold,Arnaud Bergeron,Joshua Greaves,Alex Fréchette,Carolyne Pelletier,Eric Thibodeau-Laufer,Sándor Toth,Sam Work
発行日	2025-03-19 14:25:30+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Tapered Off-Policy REINFORCE: Stable and efficient reinforcement learning for LLMs

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー