Soft Policy Optimization: Online Off-Policy RL for Sequence Models


このペーパーでは、任意のオンラインおよびオフラインの軌跡から学習し、個別の値モデルを必要としないシーケンスモデルポリシーのシンプルでスケーラブルで原則的なソフトRLメソッドであるSoft Policy Optimization(SPO)を紹介します。


RL-based post-training of language models is almost exclusively done using on-policy methods such as PPO. These methods cannot learn from arbitrary sequences such as those produced earlier in training, in earlier runs, by human experts or other policies, or by decoding and exploration methods. This results in severe sample inefficiency and exploration difficulties, as well as a potential loss of diversity in the policy responses. Moreover, asynchronous PPO implementations require frequent and costly model transfers, and typically use value models which require a large amount of memory. In this paper we introduce Soft Policy Optimization (SPO), a simple, scalable and principled Soft RL method for sequence model policies that can learn from arbitrary online and offline trajectories and does not require a separate value model. In experiments on code contests, we shows that SPO outperforms PPO on pass@10, is significantly faster and more memory efficient, is able to benefit from off-policy data, enjoys improved stability, and learns more diverse (i.e. soft) policies.


著者 Taco Cohen,David W. Zhang,Kunhao Zheng,Yunhao Tang,Remi Munos,Gabriel Synnaeve
発行日 2025-03-07 14:23:40+00:00
arxivサイト arxiv_id(pdf)

提供元, 利用サービス, Google

カテゴリー: cs.AI, cs.LG パーマリンク